The method of text classification under the long-tailed distribution, emerging as a pivotal research focus within the domains of machine learning and natural language processing, embodies significant research significance.This section delineates methodologies closely aligned with the scope of this article, encompassing text classification algorithms and algorithms pertaining to long-tail learning. Subsequently, it provides an overview of commonly employed evaluation metrics.
Text classification algorithm
With the advancement of information technology and the advent of the big data era, the techniques of natural language processing are increasingly being applied across various domains, representing a current focal point of research.
Among the earlier algorithms are Naive Bayes (NB) [1], K Nearest Neighbor (KNN) algorithm [2], Classification and Regression Tree (CART) algorithm [3], C4.5 algorithm [4], Support Vector Machine (SVM) [5], Random Forests (RF) algorithm [6], and Extreme Gradient Boosting (XGBoost) algorithm [7]. In comparison to traditional rule-based methods, although shallow feature learning methods demonstrate notable advantages in terms of accuracy and stability, they still necessitate manual feature engineering, while often overlooking the natural sequential structure or contextual information within the text, thereby presenting challenges in acquiring semantic feature information.
Since the advent of the decade starting from 2010, the landscape of text classification has experienced a discernible transition, evolving from rudimentary methods reliant on superficial features towards the ascendancy of deep learning frameworks.In contrast to methodologies predicated upon the rudimentary acquisition of superficial features, deep learning methodologies obviate the necessity for manual curation of rules and features, thereby seamlessly provisioning semantic representations for the exigencies of text mining.
Since the advent of the decade starting from 2010, the landscape of text classification has experienced a discernible transition, evolving from rudimentary methods reliant on superficial features towards the ascendancy of deep learning frameworks.In contrast to methodologies predicated upon the rudimentary acquisition of superficial features, deep learning methodologies obviate the necessity for manual curation of rules and features, thereby seamlessly provisioning semantic representations for the exigencies of text mining.
Consequently, the lion's share of contemporary scholarly endeavors in the domain of text classification has been underpinned by the adoption of sophisticated deep neural network architectures, including but not limited to Convolutional Neural Networks (CNN)[8], Recurrent Neural Networks (RNN)[9], Long Short-Term Memory (LSTM) networks[10], as well as the pioneering Bidirectional Encoder Representations from Transformers (BERT)[11], among others.In contrast to their counterparts in the realm of deep neural network architectures, such as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) networks, Convolutional Neural Networks (CNN) boast formidable parallel processing capabilities, owing to the intrinsic independence among multiple convolutional kernels within the convolutional layer, thereby facilitating concurrent computation. In the annals of 2014, Kim introduced the TextCNN model, thereby heralding a novel era in the domain of sentence-level text classification by ingeniously harnessing the power of Convolutional Neural Networks (CNN) to encode n-gram features. Renowned for its parsimonious architecture, expedited training regimen, and commendable efficacy, this model stands as a beacon of relevance, particularly tailored to the exigencies of short to medium-length text classification endeavors. However, by virtue of the modest dimensions of its convolutional kernels, it regrettably falls short in capturing distal semantic dependencies, thereby rendering it ill-suited for lengthier textual compositions but auspiciously tailored for the concise textual classification milieu delineated in this discourse.
DPCNN addresses the inherent limitation of TextCNN, which is unable to capture extensive textual dependencies solely through convolutional operations. In the study, the DPCNN model effectively achieves the extraction of long-range textual dependencies by iteratively increasing network depth.DPCNN achieves an optimal accuracy by increasing the depth of the network without imposing substantial additional computational overhead.BERT represents a pivotal milestone in the evolution of text classification and other natural language processing methodologies, showcasing superior performance across various NLP tasks, notably in text classification. As a result, a plethora of research endeavors in text classification have been grounded in BERT [12][13][14][15].ERNIE[16] leverages a richer corpus of Chinese language data and enhances the masking mechanism to align with the specific characteristics of Chinese language, yielding notable performance improvements across diverse natural language processing tasks. Furthermore, scholars have increasingly delved into the realm of text classification methodologies grounded in Graph Neural Networks [16][17][18][19](GNN) to capture textual structural nuances. Notably, GPT (Generative pre-trained) [19] stands out as a prominent example. Collectively, the trajectory of deep learning-based text classification is experiencing rapid evolution, navigating from idealized data settings to pragmatic implementation in increasingly demanding real-world scenarios.
In tandem with the evolution of information technology and the onset of the big data era, text classification has emerged as a critical component of natural language processing, providing a vital mechanism for humanity to manage the deluge of textual information. Despite the accelerated advancements in deep learning (DL) witnessed in recent years, yielding a myriad of remarkable accomplishments in text classification tasks such as ELMo (Embeddings From Language Modeling) [20], BERT (Bidirectional Transformer for Language Understanding) [21], and network pruning [22].Nevertheless, the efficacy of supervised deep learning hinges significantly upon both the caliber and abundance of annotated samples, thereby posing challenges in real-world scenarios, encompassing: firstly, the arduous and unbounded nature of annotating copious samples, engendering a dearth of annotated samples requisite for deep learning models; secondly, the heightened temporal demands associated with mining data from internet user-generated content. In contrast to conventional deep learning methodologies, human beings possess the capability to glean insights from scant exemplars, swiftly and accurately discerning the categorization of novel instances.In scenarios characterized by a paucity of annotated samples, tackling the classification dilemmas inherent in internet user-generated data through deep learning methodologies has emerged as a primary research endeavor. Some classification learning samples suffer from insufficiency, embodying a long-tail distribution learning algorithm, a challenge that this paper also seeks to address.
Learning algorithms for long tailed distributions
Real-world datasets frequently manifest a profound imbalance in classes, characterized by a long-tailed distribution, thereby presenting formidable hurdles for deep recognition models. The long-tailed distribution denotes a scenario in classification tasks where the training dataset displays a significant variance in instance counts among distinct labels. While the head labels boast an abundance of training instances, they represent a minor fraction of the total label count, with the tail labels comprising only a scant number of training instances. Real-world data often exhibits a severe class imbalance with a long-tailed distribution, posing significant challenges for deep recognition models. The resolution of long-tailed distribution learning algorithms primarily revolves around rebalancing-based, transfer learning-based, and few-shot learning-based approaches.Rebalancing-based algorithms predominantly concentrate on data-level manipulation, with scholars devising data rebalancing strategies to rectify skewed classification quandaries. Data rebalancing entails refining sampling methodologies to rectify dataset imbalances, thereby ensuring an equitable distribution of instances across various labels prior to classifier training. This process involves homogenizing the instance counts for different labels, approximating equilibrium between tail and head label instances. The ensuing classifier training employs the rebalanced dataset, thereby augmenting the classification efficacy of tail labels. Existing methodologies can be roughly classified into three categories: upsampling [23] [24] [25] [25–27], downsampling [28, 29], and reweighting [26] [27] [28] [29] [31–34].The upsampling method embodies the SMOTE (Synthetic Minority Oversampling TEchnique) algorithm [25]. The fundamental concept of SMOTE is to address the scarcity of tail label instances by synthesizing new instance data based on the nearest neighbors of existing tail label instances. Nonetheless, a drawback of this approach is the potential generation of instances resembling those of the head labels, thus introducing noise data and diminishing classification performance. Nonetheless, on the whole, SMOTE outperforms random upsampling techniques. Conversely, downsampling methods are antithetical to upsampling techniques, eliminating head label training instances to achieve balance in the number of training instances between head and tail labels. The most straightforward downsampling approach involves randomly discarding a portion of head label instances, achieving data equilibrium but unavoidably resulting in the loss of crucial information from the head labels.Reweighting algorithms primarily tackle data imbalance issues by assigning greater weights to samples from the tail labels through adjustments to the loss function.A substantial body [31–34] of algorithms modifies the loss function employed in model training, assigning greater weights to samples from the tail labels to furnish enhanced supervision signals for tail label learning. Nevertheless, such algorithms entail a partial compromise on feature learning and fail to fundamentally resolve the underlying issue. The proposed Bilateral-Branch Network (BBN) employs a typical uniform sampler for learning general recognition patterns and a reverse sampler to model tail data, thereby refining the precision of tail data recognition. Transfer learning-based methodologies harness the knowledge-rich head labels to accumulate insights, facilitating the selective transfer of pertinent information to tail labels for improved classification performance. The LEAP [36] [32] algorithm enriches the distribution of tail labels by transposing the variance from head labels, thereby mitigating feature space distortion. Specifically, this approach constructs a distinct feature cloud for each label, ensuring that tail labels, characterized by their intra-class homogeneity, benefit from a broader distribution range. The study [38] recursively propagates acquired meta-knowledge from head labels to tail labels, employing dynamic meta-embedding techniques to forge an amalgamated algorithm that bolsters the resilience of tail recognition through the association of visual concepts between head and tail label embeddings. Designs of transfer learning-based long-tail algorithms primarily glean pertinent insights from head labels and transfer them to tail labels to augment tail label classification.
Low-shot and few-shot learning share similarities with long-tail learning, as they all encompass labels with varying instance counts, with some labels having abundant instances and others having fewer. The objective of low-shot learning is to ensure the model fits well and achieves optimal performance even with a limited number of instances [39–42]. Paper [39] synthesizes instances based on the head label classifier and integrates them into the training of the tail label model. Paper [40] introduces an attention-based model that utilizes a head label classification weight generator to craft a classifier capable of better generalization to tail labels while retaining features gleaned from head label training for tail label adaptation. Similarly, paper [41] proposes a straightforward memory strategy, where the learner is initially trained on head labels with ample instances before extending to tail labels with fewer instances. Paper [42] presents prototype networks, employing metric-based learning techniques to alleviate the effects of overfitting due to sparse data by mapping instances into a metric space where instances of the same label are proximal while those of different labels are distanced.
POI text classification modeling process
Within the conventional supervised learning paradigm, the classification task undertaken in this study entails distinct, singular labels. Within this framework, numerous viable algorithms exhibit outstanding performance. However, real-world entities diverge from traditional data as they often exhibit profoundly imbalanced distributions, a factor that frequently constrains the applicability of deep network-based recognition models in practical settings. This propensity arises from their inclination to favor dominant classes, resulting in heightened performance on head classes while showcasing pronounced deficiencies on tail classes. The data distribution depicted in this paper is illustrated in the figure below, and throughout the classification process, researchers have adopted multi-level labeling, as depicted in Fig. 2 for select categories. Notably, within the same category, a plethora of related label terms exists. Leveraging LEAP[32] [36] to construct feature clouds may, to some extent, mitigate the issue of low feature matching for tail labels.
Fg1. POI data with long tail distribution
Table 1
Partial POI data double-layer labels and their encoding
No. | Category | Subcategory | Type | Classification Code |
---|
1 | catering | Restaurants | Chinese restaurant | 110101 |
2 | Western restaurant | 110102 |
3 | Local flavors and famous stores | 110103 |
4 | Fast Food | Fast Food | 110201 |
5 | Casual Dining | Bar | 110301 |
6 | Café | 110302 |
7 | Tea House | 110303 |
| …… | …… | …… | …… |
Overall algorithm framework
The holistic framework of the POI data classification algorithm proposed in this paper, as depicted in Fig. 3, delineates into three stages: data preprocessing, feature fusion, and post-processing. Pretrained models such as BERT, ERNIE, and TextCNN are employed for feature extraction and pretraining in this paper.
Data preprocessing: This phase involves handling the raw POI data, with data preprocessing being of paramount importance.It may encompass tasks such as re-encoding classification codes, re-encoding the Classification Code from the table above, and converting data from Excel format to text format.Character standardization involves uniformly formatting numerical, traditional, and abbreviated alphabetic characters, as well as rectifying erroneous characters found in the text.Standardizing Arabic numerals to Chinese characters helps prevent ambiguity, while cleansing special characters from the data enhances data quality and improves classification outcomes.
Feature extraction: POI data typically comprises short texts ranging from approximately 3 to 32 characters in length. Short texts often contain limited feature information, and simplistic sentence parsing may overlook crucial details. The naming structure of POIs is distinctive, often containing thematic information within the names, known as feature word information. Moreover, feature information is predominantly situated towards the end of the name text. For instance, in ("Home Inn Express Hotel"), the feature word is "Hotel", in ("Cloud Living Community Supermarket"), the feature information is "Supermarket", and in ("Xinxin Big Pharmacy"), the thematic feature word is "Pharmacy". Failure to allocate weights appropriately may result in models inadequately capturing feature word information by inputting all words from POI names indiscriminately. Therefore, this paper extracts key words from POI names, allocating more weight to words appearing towards the end to distinguish them from others.
Post-processing: Following feature extraction and fusion, post-processing techniques are employed to refine the classification results. This phase involves cleansing noisy data, refining classification boundaries, and enhancing the overall accuracy of the classification model. These stages collectively constitute the framework of the POI data classification algorithm, guiding the process from raw data to refined classification outcomes. To address the imbalance issues introduced by long-tail data, this paper adopts weighted random sampling under the PyTorch framework to rebalance POI data, resulting in approximately a 2% improvement in Precision, Recall, and F1-Score metrics.
Fg2. The holistic framework of the POI data classification algorithm
Bert structure
BERT (Bidirectional Encoder Representation from Transformers), introduced by the Google team in 2018, is a large-scale pretrained model that has achieved state-of-the-art results in 11 natural language tasks. It utilizes bidirectional Transformers encoders to represent text, integrating both preceding and succeeding context from various layers of the network into its pretraining process, thereby obtaining deep bidirectional word embeddings.BERT combines the contextual information from both preceding and succeeding layers of the network, enabling them to participate in the model's pretraining process, resulting in deep bidirectional word embeddings. The structural model is illustrated in the diagram below.
Fg3. BERT model structure diagram
The structure of the BERT model can be divided from bottom to top into the input layer, bidirectional Transformer layers, and output layer.The input layer combines three vectors: the one-dimensional vector for each token (Token Embedding), the vector dimension of the entire sentence (Sentence Embedding), and the positional embedding vector that records the positional information of the sentence, to form the input of the model. The intermediate bidirectional Transformer layers, consisting of encoder layers and self-attention layers, primarily extract features from the text.The output layer primarily involves backpropagating loss, predicting the probability between the target word and its context after multiple training iterations, and producing the output.
The core of the BERT model lies in its dual-layer Transformer architecture.The Transformer model, developed by the Google team in 2017, is a natural language processing model based on attention mechanisms.Instead of extracting features through CNNs and RNNs, the Transformer model applies the attention mechanism directly to the text, utilizing attention weights to extract feature information.The Transformer model primarily consists of an encoder (Encode) and a decoder (Decoder). The encoder transforms the input text sequence into vector representations recognizable by computers, while the decoder converts the text vectors output by the encoder back into text sequences.
The multi-head self-attention layer in the Encoder consists of multiple self-attention mechanisms, where self-attention refers to tensors calculated by multiplying the same input with different matrices.Each self-attention mechanism can independently learn contextual information, enabling the multi-head self-attention layer to capture richer contextual information.The calculation of self-attention employs the scaled dot-product attention formula:
$$Softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)$$
The purpose of dividing \(Q{K}^{T}\) by \(\sqrt{{d}_{k}}\) is to scale \(Q{K}^{T}\) to a certain range, preventing excessively large or small calculation results, thus avoiding issues of probabilities being too large or too small during Softmax calculation, and making the model easier to learn.\({d}_{k}\) represents the dimensionality of the embedding. Utilizing C transforms the calculation results into probabilities, with the formula:
$$Softmax\left({x}_{i}\right)=\frac{{exp}^{{x}_{i}}}{{\sum }_{j=1}^{N}{exp}^{{x}_{j}}}$$
The formula for the normalization layer is
$$LayerNorm\left(x\right)=\gamma \frac{x-\mu }{\sqrt{{\sigma }^{2}+ϵ}}+\beta$$
Here, \(\text{x}\) represents the input tensor, while \({\gamma }\text{a}\text{n}\text{d} {\beta }\) denote the scaling and offset factors, and U and O represent the mean and variance computed along the last dimension, where \(\text{ϵ}\) ensures the variance is positive to prevent division by zero.
The loss function consists of the Masked Language Model (MLM) loss and the Next Sentence Prediction (NSP) loss.Both of these losses are calculated using cross-entropy, but their specific calculation methods differ.The Masked Language Model (MLM) loss refers to the loss incurred by the model when predicting masked words.Specifically, for each input word, the model outputs a probability distribution representing the likelihood of that word being each word in the vocabulary.Subsequently, the model calculates the cross-entropy loss based on the true words and the predicted probability distribution.Since only 15% of the words are masked, only the loss of these words will be calculated, while the losses of other words will be ignored.Finally, the model averages the losses of all masked words to obtain the MLM loss. The MLM loss can be represented by the following formula:
$${\text{L}}_{\text{M}\text{L}\text{M}}=-\frac{1}{\text{N}}{\sum }_{\text{i}=1}^{\text{N}}\text{l}\text{o}\text{g}\text{P}\left({\text{w}}_{\text{i}}\right|{\text{C}}_{\text{i}})$$
Here, \(\text{N}\) represents the count of masked words, \({\text{w}}_{\text{i}}\) denotes the \(\text{i}\)-th masked word, \({\text{C}}_{\text{i}}\) stands for the corresponding context, and \(\text{P}\left({\text{w}}_{\text{i}}\right|{\text{C}}_{\text{i}})\) signifies the probability distribution predicted by the model.
The loss for the Next Sentence Prediction (NSP) is calculated using the following formula:
$${\text{L}}_{\text{S}\text{P}}=-\frac{1}{\text{M}}{\sum }_{\text{j}=1}^{\text{M}}\text{l}\text{o}\text{g}\text{P}\left({\text{y}}_{\text{j}}|{\text{S}}_{\text{j}}\right)$$
Here, \(\text{M}\) represents the number of sentence pairs, \({\text{y}}_{\text{j}}\) represents the true label of the j-th sentence pair (0 indicates NotNext, 1 indicates IsNext), C_i represents the j-th sentence pair, and \(\text{P}\left({\text{y}}_{\text{j}}|{\text{S}}_{\text{j}}\right)\) represents the probability distribution predicted by the model. The total loss of BERT is the weighted sum of the MLM loss and the NSP loss, which can be represented by the following formula:
$${\text{L}}_{\text{B}\text{E}\text{R}\text{T}}={\text{L}}_{\text{M}\text{L}\text{M}}+{\lambda }{\text{L}}_{\text{N}\text{S}\text{P}}$$
Real POI data based on city geographic coordinates exhibits an important characteristic of a long-tail distribution. This paper proposes a Dual-layer Label Perception Algorithm for POI data, utilizing manually categorized secondary labels. To address the issue of insufficiently clear classification for POI data, the secondary labels collected through manual classification are used to construct a feature cloud for primary classification labels. Secondary labels typically represent finer-grained classifications of primary labels, which can better match the data. This method borrows from multi-label classification principles to tackle the problem of insufficient accuracy in single-layer label classification. The algorithm classifies the data using secondary labels, and then compares the predicted labels of secondary classifications belonging to the same primary category to obtain the average value, which is then compared with the predicted labels of the primary category. If the accuracy of primary classification is low, the predicted labels of the data are updated to the predicted labels of the secondary labels. For long-tail data where the number of labels is too small, the feature cloud has a larger distribution range, which can improve the recognition accuracy caused by insufficient number of tail labels.
Algorithm1: Dual-layer Label Perception Algorithm for POI Data |
---|
Input: Data, primary labels, secondary labels Output: Predicted label\(\widehat{Y}\) 1. Predicted label \(\widehat{Y}\)is obtained through training based on Fig. 1 for primary classification labels, serving as the initial predicted label \(\widehat{Y}\) 2. Train the model using secondary classification labels based on Fig. 1, and compute the predicted label \({\widehat{\gamma }}_{2}\)by averaging the weighted classification thresholds corresponding to the secondary labels 3. For each POI data : 4. IF the classification threshold BB of primary labels \({\widehat{{\gamma }}}_{1}\)< 0.6 5. IF the classification threshold of secondary labels \({\widehat{\gamma }}_{2}-{\widehat{{\gamma }}}_{1}\)>0.1 6. Assign the predicted label \({\widehat{\gamma }}_{2}\) of secondary labels as the predicted label \(\widehat{Y}\) 7. Update the labels to the primary labels to which the secondary labels belong 8. end for 9. Obtain the new predicted label \(\widehat{Y}\) |