Dual-layer Label Perception Algorithm for POI data with long-tailed distribution

doi:10.21203/rs.3.rs-4375997/v1

Download PDF

Research Article

Dual-layer Label Perception Algorithm for POI data with long-tailed distribution

https://doi.org/10.21203/rs.3.rs-4375997/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Authentic POI datasets often exhibit a long-tailed distribution, wherein instances vary significantly across labels, rendering conventional training methods ill-equipped to handle such real-world scenarios. This paper introduces a Dual-layer Label Perception Algorithm for POI Data, which pioneers the utilization of prevalent two-layer data labels to formulate a label feature cloud, subsequently enhancing classification accuracy by broadening and refining the distribution of label features. To assess the algorithm's efficacy, extensive experimentation is conducted on genuine urban POI datasets, resulting in a notable 5% enhancement in the comprehensive F1-Score, reaching 79.41%. The experimental findings affirm the algorithm model's efficacy in enhancing its detection performance across various advanced text classification algorithms.

Text classification

bert

POI data

Long-tailed Learning

Imbalanced Learning

Within the realm of Natural Language Processing (NLP), text classification emerges as a cornerstone undertaking. In the domain of surveying engineering, POI (Points of Interest) data represents indispensable geospatial information, constituting a pivotal constituent of "TianDiTu" (Map World), extensively employed across an array of foundational geospatial ventures, spanning geographic initiatives for national census surveillance, foundational geographic entity fabrication, assorted thematic cartographic endeavors, and national spatial planning schemes. Nonetheless, these POI textual entries often exist in an unstructured format, documented as textual entries, bereft of automated classification mechanisms over protracted periods. Consequently, the preprocessing and ensuing classification of these POI textual entries assume profound significance.

The classification of POI data text stands as the pivotal aspect within the realm of POI data processing. Traditional text classification relies upon the principles of knowledge engineering, wherein manual classification by experts serves as the modus operandi, furnishing a relatively efficacious avenue for information retrieval. Presently, numerous expansive network information systems have erected vast repositories of information resources, predominantly fashioned through manual endeavors.Currently, the predominant focus within the cartography and surveying sector resides in singular manual classification methodologies for map POI data, yet the manifold drawbacks of such manual approaches loom large: they prove time-intensive, laborious, inefficient, challenging to uphold quality standards, beset by subjectivity, and marred by inconsistencies across different individuals' executions.Furthermore, in practical production scenarios, it is common to confront exigencies necessitating immediate project acceptance, often requiring overtime commitments to ensure timely completion.

In recent years, the utilization of computers for text classification has emerged as a paramount research focus, with automatic text classification technology permeating various domains within information technology. Text classification, pivotal across realms such as information filtering, digital library management, question-answering systems, and news categorization, significantly enhances the quality of information services. Hence, elucidating methods for computers to comprehend and leverage geographical nomenclature embedded within POI datasets, and effectuating automated POI data classification conforming to classification standards, constitutes a vital quandary necessitating comprehensive investigation. In the context of map POI data classification, machine learning algorithms present themselves as potent tools for automated classification endeavors. Through the utilization of copious well-annotated sample data, the classification model can undergo training, enabling it to autonomously categorize new data based on discernible features. Such an approach not only augments classification efficiency but also mitigates the impact of human subjectivity, thereby enhancing the consistency and accuracy of classification outcomes.

Prominent among the roster of machine learning classification algorithms are Support Vector Machine (SVM), Decision Tree, Random Forest, Neural Network, and others. Tailored feature representation methods, such as the bag-of-words model, TF-IDF, among others, can be chosen to suit the peculiarities of POI data, thereafter utilized for training classification models.

Presently, the prevailing approach entails employing deep learning for text classification endeavors. In recent years, owing to the rapid strides in deep learning, automated classification of Chinese text has attained unprecedented efficiency. Leveraging Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), among other deep learning methodologies, capitalizes on their adeptness in processing textual data for classification purposes. Deep learning models possess the capability to autonomously acquire feature representations for intricate text classification tasks. Incorporating machine learning and deep learning methodologies enables the automated classification of POI data, thereby enhancing classification efficiency and accuracy while diminishing labor costs, consequently aligning with project requirements more effectively. Specifically, POI data predominantly comprises short text, wherein, in contrast to lengthy textual content, short text manifests with diminished feature information, resulting in sparser vectors at higher dimensions, thereby enhancing classification efficacy. Notably, the pioneering TextCNN model, featuring multi-scale convolution, adeptly captures varied length features within POI data, thereby augmenting classification efficacy.

The POI data examined in this paper's mapping project exhibits diverse characteristics. Map POI data, being authentic, portrays a plethora of instances wherein the head of the label encompasses abundant samples, while the tail comprises a scarcity thereof, thereby manifesting a long-tailed distribution due to sample imbalance, as illustrated in the ensuing data distribution graph. Consequently, this phenomenon is commonly referred to as the long-tailed phenomenon. The utilization of mature BERT pre-trained models fails to address this scenario. Trained models exhibit a bias towards the head class, characterized by an abundance of training data, thereby leading to subpar performance on the tail class with a paucity of data, thus rendering the training of deep network-based recognition models exceedingly challenging. The pre-trained model's capacity to accommodate categories with fewer samples proves inadequate, even failing to meet real-time requirements.

This paper introduces a novel algorithm, tailored for long-tailed distributions, called the Dual-layer Label Perception Algorithm for POI Data.A BERT pre-trained model serves as the backbone network, and thorough tests are conducted utilizing pre-processed, class-balanced POI data. The primary contributions of this paper are delineated below:

1) Aggregated map POI text dataset with multi-level labels for application in text classification methodologies. This automates previously manual tasks while mitigating the inherent inconsistencies of manual classification to some degree, thereby lowering training costs, reducing labor hours, and enhancing operational efficiency.

(2) Endeavored to employ automated classification methodologies rooted in deep learning as opposed to traditional manual approaches, thereby accomplishing the automated classification of POI text, streamlining data preprocessing, and achieving a nearly 6% enhancement in classification accuracy.

(3) In terms of experimental validation, a novel double-layer label perception algorithm for POI data is introduced. Leveraging the double-layer data labels of POI data to construct label feature clouds significantly enhances the classification performance of tail labels, yielding a classification accuracy of 79.21%.

(4) Within the realm of hierarchical classification models, a diverse array of contemporary methodologies are rigorously evaluated through extensive experimentation. Furthermore, an end-to-end model training approach is implemented, bolstering the applicability of deep model training across related domains.

The method of text classification under the long-tailed distribution, emerging as a pivotal research focus within the domains of machine learning and natural language processing, embodies significant research significance.This section delineates methodologies closely aligned with the scope of this article, encompassing text classification algorithms and algorithms pertaining to long-tail learning. Subsequently, it provides an overview of commonly employed evaluation metrics.

Text classification algorithm

With the advancement of information technology and the advent of the big data era, the techniques of natural language processing are increasingly being applied across various domains, representing a current focal point of research.

Among the earlier algorithms are Naive Bayes (NB) [1], K Nearest Neighbor (KNN) algorithm [2], Classification and Regression Tree (CART) algorithm [3], C4.5 algorithm [4], Support Vector Machine (SVM) [5], Random Forests (RF) algorithm [6], and Extreme Gradient Boosting (XGBoost) algorithm [7]. In comparison to traditional rule-based methods, although shallow feature learning methods demonstrate notable advantages in terms of accuracy and stability, they still necessitate manual feature engineering, while often overlooking the natural sequential structure or contextual information within the text, thereby presenting challenges in acquiring semantic feature information.

Since the advent of the decade starting from 2010, the landscape of text classification has experienced a discernible transition, evolving from rudimentary methods reliant on superficial features towards the ascendancy of deep learning frameworks.In contrast to methodologies predicated upon the rudimentary acquisition of superficial features, deep learning methodologies obviate the necessity for manual curation of rules and features, thereby seamlessly provisioning semantic representations for the exigencies of text mining.

Consequently, the lion's share of contemporary scholarly endeavors in the domain of text classification has been underpinned by the adoption of sophisticated deep neural network architectures, including but not limited to Convolutional Neural Networks (CNN)[8], Recurrent Neural Networks (RNN)[9], Long Short-Term Memory (LSTM) networks[10], as well as the pioneering Bidirectional Encoder Representations from Transformers (BERT)[11], among others.In contrast to their counterparts in the realm of deep neural network architectures, such as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) networks, Convolutional Neural Networks (CNN) boast formidable parallel processing capabilities, owing to the intrinsic independence among multiple convolutional kernels within the convolutional layer, thereby facilitating concurrent computation. In the annals of 2014, Kim introduced the TextCNN model, thereby heralding a novel era in the domain of sentence-level text classification by ingeniously harnessing the power of Convolutional Neural Networks (CNN) to encode n-gram features. Renowned for its parsimonious architecture, expedited training regimen, and commendable efficacy, this model stands as a beacon of relevance, particularly tailored to the exigencies of short to medium-length text classification endeavors. However, by virtue of the modest dimensions of its convolutional kernels, it regrettably falls short in capturing distal semantic dependencies, thereby rendering it ill-suited for lengthier textual compositions but auspiciously tailored for the concise textual classification milieu delineated in this discourse.

DPCNN addresses the inherent limitation of TextCNN, which is unable to capture extensive textual dependencies solely through convolutional operations. In the study, the DPCNN model effectively achieves the extraction of long-range textual dependencies by iteratively increasing network depth.DPCNN achieves an optimal accuracy by increasing the depth of the network without imposing substantial additional computational overhead.BERT represents a pivotal milestone in the evolution of text classification and other natural language processing methodologies, showcasing superior performance across various NLP tasks, notably in text classification. As a result, a plethora of research endeavors in text classification have been grounded in BERT [12][13][14][15].ERNIE[16] leverages a richer corpus of Chinese language data and enhances the masking mechanism to align with the specific characteristics of Chinese language, yielding notable performance improvements across diverse natural language processing tasks. Furthermore, scholars have increasingly delved into the realm of text classification methodologies grounded in Graph Neural Networks [16][17][18][19](GNN) to capture textual structural nuances. Notably, GPT (Generative pre-trained) [19] stands out as a prominent example. Collectively, the trajectory of deep learning-based text classification is experiencing rapid evolution, navigating from idealized data settings to pragmatic implementation in increasingly demanding real-world scenarios.

In tandem with the evolution of information technology and the onset of the big data era, text classification has emerged as a critical component of natural language processing, providing a vital mechanism for humanity to manage the deluge of textual information. Despite the accelerated advancements in deep learning (DL) witnessed in recent years, yielding a myriad of remarkable accomplishments in text classification tasks such as ELMo (Embeddings From Language Modeling) [20], BERT (Bidirectional Transformer for Language Understanding) [21], and network pruning [22].Nevertheless, the efficacy of supervised deep learning hinges significantly upon both the caliber and abundance of annotated samples, thereby posing challenges in real-world scenarios, encompassing: firstly, the arduous and unbounded nature of annotating copious samples, engendering a dearth of annotated samples requisite for deep learning models; secondly, the heightened temporal demands associated with mining data from internet user-generated content. In contrast to conventional deep learning methodologies, human beings possess the capability to glean insights from scant exemplars, swiftly and accurately discerning the categorization of novel instances.In scenarios characterized by a paucity of annotated samples, tackling the classification dilemmas inherent in internet user-generated data through deep learning methodologies has emerged as a primary research endeavor. Some classification learning samples suffer from insufficiency, embodying a long-tail distribution learning algorithm, a challenge that this paper also seeks to address.

Learning algorithms for long tailed distributions

Real-world datasets frequently manifest a profound imbalance in classes, characterized by a long-tailed distribution, thereby presenting formidable hurdles for deep recognition models. The long-tailed distribution denotes a scenario in classification tasks where the training dataset displays a significant variance in instance counts among distinct labels. While the head labels boast an abundance of training instances, they represent a minor fraction of the total label count, with the tail labels comprising only a scant number of training instances. Real-world data often exhibits a severe class imbalance with a long-tailed distribution, posing significant challenges for deep recognition models. The resolution of long-tailed distribution learning algorithms primarily revolves around rebalancing-based, transfer learning-based, and few-shot learning-based approaches.Rebalancing-based algorithms predominantly concentrate on data-level manipulation, with scholars devising data rebalancing strategies to rectify skewed classification quandaries. Data rebalancing entails refining sampling methodologies to rectify dataset imbalances, thereby ensuring an equitable distribution of instances across various labels prior to classifier training. This process involves homogenizing the instance counts for different labels, approximating equilibrium between tail and head label instances. The ensuing classifier training employs the rebalanced dataset, thereby augmenting the classification efficacy of tail labels. Existing methodologies can be roughly classified into three categories: upsampling [23] [24] [25] [25–27], downsampling [28, 29], and reweighting [26] [27] [28] [29] [31–34].The upsampling method embodies the SMOTE (Synthetic Minority Oversampling TEchnique) algorithm [25]. The fundamental concept of SMOTE is to address the scarcity of tail label instances by synthesizing new instance data based on the nearest neighbors of existing tail label instances. Nonetheless, a drawback of this approach is the potential generation of instances resembling those of the head labels, thus introducing noise data and diminishing classification performance. Nonetheless, on the whole, SMOTE outperforms random upsampling techniques. Conversely, downsampling methods are antithetical to upsampling techniques, eliminating head label training instances to achieve balance in the number of training instances between head and tail labels. The most straightforward downsampling approach involves randomly discarding a portion of head label instances, achieving data equilibrium but unavoidably resulting in the loss of crucial information from the head labels.Reweighting algorithms primarily tackle data imbalance issues by assigning greater weights to samples from the tail labels through adjustments to the loss function.A substantial body [31–34] of algorithms modifies the loss function employed in model training, assigning greater weights to samples from the tail labels to furnish enhanced supervision signals for tail label learning. Nevertheless, such algorithms entail a partial compromise on feature learning and fail to fundamentally resolve the underlying issue. The proposed Bilateral-Branch Network (BBN) employs a typical uniform sampler for learning general recognition patterns and a reverse sampler to model tail data, thereby refining the precision of tail data recognition. Transfer learning-based methodologies harness the knowledge-rich head labels to accumulate insights, facilitating the selective transfer of pertinent information to tail labels for improved classification performance. The LEAP [36] [32] algorithm enriches the distribution of tail labels by transposing the variance from head labels, thereby mitigating feature space distortion. Specifically, this approach constructs a distinct feature cloud for each label, ensuring that tail labels, characterized by their intra-class homogeneity, benefit from a broader distribution range. The study [38] recursively propagates acquired meta-knowledge from head labels to tail labels, employing dynamic meta-embedding techniques to forge an amalgamated algorithm that bolsters the resilience of tail recognition through the association of visual concepts between head and tail label embeddings. Designs of transfer learning-based long-tail algorithms primarily glean pertinent insights from head labels and transfer them to tail labels to augment tail label classification.

Low-shot and few-shot learning share similarities with long-tail learning, as they all encompass labels with varying instance counts, with some labels having abundant instances and others having fewer. The objective of low-shot learning is to ensure the model fits well and achieves optimal performance even with a limited number of instances [39–42]. Paper [39] synthesizes instances based on the head label classifier and integrates them into the training of the tail label model. Paper [40] introduces an attention-based model that utilizes a head label classification weight generator to craft a classifier capable of better generalization to tail labels while retaining features gleaned from head label training for tail label adaptation. Similarly, paper [41] proposes a straightforward memory strategy, where the learner is initially trained on head labels with ample instances before extending to tail labels with fewer instances. Paper [42] presents prototype networks, employing metric-based learning techniques to alleviate the effects of overfitting due to sparse data by mapping instances into a metric space where instances of the same label are proximal while those of different labels are distanced.

POI text classification modeling process

Within the conventional supervised learning paradigm, the classification task undertaken in this study entails distinct, singular labels. Within this framework, numerous viable algorithms exhibit outstanding performance. However, real-world entities diverge from traditional data as they often exhibit profoundly imbalanced distributions, a factor that frequently constrains the applicability of deep network-based recognition models in practical settings. This propensity arises from their inclination to favor dominant classes, resulting in heightened performance on head classes while showcasing pronounced deficiencies on tail classes. The data distribution depicted in this paper is illustrated in the figure below, and throughout the classification process, researchers have adopted multi-level labeling, as depicted in Fig. 2 for select categories. Notably, within the same category, a plethora of related label terms exists. Leveraging LEAP[32] [36] to construct feature clouds may, to some extent, mitigate the issue of low feature matching for tail labels.

Fg1. POI data with long tail distribution

Table 1

Partial POI data double-layer labels and their encoding
No.	Category	Subcategory	Type	Classification Code
1	catering	Restaurants	Chinese restaurant	110101
2			Western restaurant	110102
3			Local flavors and famous stores	110103
4		Fast Food	Fast Food	110201
5		Casual Dining	Bar	110301
6			Café	110302
7			Tea House	110303
	……	……	……	……

Overall algorithm framework

The holistic framework of the POI data classification algorithm proposed in this paper, as depicted in Fig. 3, delineates into three stages: data preprocessing, feature fusion, and post-processing. Pretrained models such as BERT, ERNIE, and TextCNN are employed for feature extraction and pretraining in this paper.

Data preprocessing: This phase involves handling the raw POI data, with data preprocessing being of paramount importance.It may encompass tasks such as re-encoding classification codes, re-encoding the Classification Code from the table above, and converting data from Excel format to text format.Character standardization involves uniformly formatting numerical, traditional, and abbreviated alphabetic characters, as well as rectifying erroneous characters found in the text.Standardizing Arabic numerals to Chinese characters helps prevent ambiguity, while cleansing special characters from the data enhances data quality and improves classification outcomes.

Feature extraction: POI data typically comprises short texts ranging from approximately 3 to 32 characters in length. Short texts often contain limited feature information, and simplistic sentence parsing may overlook crucial details. The naming structure of POIs is distinctive, often containing thematic information within the names, known as feature word information. Moreover, feature information is predominantly situated towards the end of the name text. For instance, in ("Home Inn Express Hotel"), the feature word is "Hotel", in ("Cloud Living Community Supermarket"), the feature information is "Supermarket", and in ("Xinxin Big Pharmacy"), the thematic feature word is "Pharmacy". Failure to allocate weights appropriately may result in models inadequately capturing feature word information by inputting all words from POI names indiscriminately. Therefore, this paper extracts key words from POI names, allocating more weight to words appearing towards the end to distinguish them from others.

Post-processing: Following feature extraction and fusion, post-processing techniques are employed to refine the classification results. This phase involves cleansing noisy data, refining classification boundaries, and enhancing the overall accuracy of the classification model. These stages collectively constitute the framework of the POI data classification algorithm, guiding the process from raw data to refined classification outcomes. To address the imbalance issues introduced by long-tail data, this paper adopts weighted random sampling under the PyTorch framework to rebalance POI data, resulting in approximately a 2% improvement in Precision, Recall, and F1-Score metrics.

Fg2. The holistic framework of the POI data classification algorithm

Bert structure

BERT (Bidirectional Encoder Representation from Transformers), introduced by the Google team in 2018, is a large-scale pretrained model that has achieved state-of-the-art results in 11 natural language tasks. It utilizes bidirectional Transformers encoders to represent text, integrating both preceding and succeeding context from various layers of the network into its pretraining process, thereby obtaining deep bidirectional word embeddings.BERT combines the contextual information from both preceding and succeeding layers of the network, enabling them to participate in the model's pretraining process, resulting in deep bidirectional word embeddings. The structural model is illustrated in the diagram below.

Fg3. BERT model structure diagram

The structure of the BERT model can be divided from bottom to top into the input layer, bidirectional Transformer layers, and output layer.The input layer combines three vectors: the one-dimensional vector for each token (Token Embedding), the vector dimension of the entire sentence (Sentence Embedding), and the positional embedding vector that records the positional information of the sentence, to form the input of the model. The intermediate bidirectional Transformer layers, consisting of encoder layers and self-attention layers, primarily extract features from the text.The output layer primarily involves backpropagating loss, predicting the probability between the target word and its context after multiple training iterations, and producing the output.

The core of the BERT model lies in its dual-layer Transformer architecture.The Transformer model, developed by the Google team in 2017, is a natural language processing model based on attention mechanisms.Instead of extracting features through CNNs and RNNs, the Transformer model applies the attention mechanism directly to the text, utilizing attention weights to extract feature information.The Transformer model primarily consists of an encoder (Encode) and a decoder (Decoder). The encoder transforms the input text sequence into vector representations recognizable by computers, while the decoder converts the text vectors output by the encoder back into text sequences.

The multi-head self-attention layer in the Encoder consists of multiple self-attention mechanisms, where self-attention refers to tensors calculated by multiplying the same input with different matrices.Each self-attention mechanism can independently learn contextual information, enabling the multi-head self-attention layer to capture richer contextual information.The calculation of self-attention employs the scaled dot-product attention formula:

$$Softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)$$

The purpose of dividing $Q{K}^{T}$ by $\sqrt{{d}_{k}}$ is to scale $Q{K}^{T}$ to a certain range, preventing excessively large or small calculation results, thus avoiding issues of probabilities being too large or too small during Softmax calculation, and making the model easier to learn.${d}_{k}$ represents the dimensionality of the embedding. Utilizing C transforms the calculation results into probabilities, with the formula:

$$Softmax\left({x}_{i}\right)=\frac{{exp}^{{x}_{i}}}{{\sum }_{j=1}^{N}{exp}^{{x}_{j}}}$$

The formula for the normalization layer is

$$LayerNorm\left(x\right)=\gamma \frac{x-\mu }{\sqrt{{\sigma }^{2}+ϵ}}+\beta$$

Here, $\text{x}$ represents the input tensor, while ${\gamma }\text{a}\text{n}\text{d} {\beta }$ denote the scaling and offset factors, and U and O represent the mean and variance computed along the last dimension, where $\text{ϵ}$ ensures the variance is positive to prevent division by zero.

The loss function consists of the Masked Language Model (MLM) loss and the Next Sentence Prediction (NSP) loss.Both of these losses are calculated using cross-entropy, but their specific calculation methods differ.The Masked Language Model (MLM) loss refers to the loss incurred by the model when predicting masked words.Specifically, for each input word, the model outputs a probability distribution representing the likelihood of that word being each word in the vocabulary.Subsequently, the model calculates the cross-entropy loss based on the true words and the predicted probability distribution.Since only 15% of the words are masked, only the loss of these words will be calculated, while the losses of other words will be ignored.Finally, the model averages the losses of all masked words to obtain the MLM loss. The MLM loss can be represented by the following formula:

$${\text{L}}_{\text{M}\text{L}\text{M}}=-\frac{1}{\text{N}}{\sum }_{\text{i}=1}^{\text{N}}\text{l}\text{o}\text{g}\text{P}\left({\text{w}}_{\text{i}}\right|{\text{C}}_{\text{i}})$$

Here, $\text{N}$ represents the count of masked words, ${\text{w}}_{\text{i}}$ denotes the $\text{i}$-th masked word, ${\text{C}}_{\text{i}}$ stands for the corresponding context, and $\text{P}\left({\text{w}}_{\text{i}}\right|{\text{C}}_{\text{i}})$ signifies the probability distribution predicted by the model.

The loss for the Next Sentence Prediction (NSP) is calculated using the following formula:

$${\text{L}}_{\text{S}\text{P}}=-\frac{1}{\text{M}}{\sum }_{\text{j}=1}^{\text{M}}\text{l}\text{o}\text{g}\text{P}\left({\text{y}}_{\text{j}}|{\text{S}}_{\text{j}}\right)$$

Here, $\text{M}$ represents the number of sentence pairs, ${\text{y}}_{\text{j}}$ represents the true label of the j-th sentence pair (0 indicates NotNext, 1 indicates IsNext), C_i represents the j-th sentence pair, and $\text{P}\left({\text{y}}_{\text{j}}|{\text{S}}_{\text{j}}\right)$ represents the probability distribution predicted by the model. The total loss of BERT is the weighted sum of the MLM loss and the NSP loss, which can be represented by the following formula:

$${\text{L}}_{\text{B}\text{E}\text{R}\text{T}}={\text{L}}_{\text{M}\text{L}\text{M}}+{\lambda }{\text{L}}_{\text{N}\text{S}\text{P}}$$

Real POI data based on city geographic coordinates exhibits an important characteristic of a long-tail distribution. This paper proposes a Dual-layer Label Perception Algorithm for POI data, utilizing manually categorized secondary labels. To address the issue of insufficiently clear classification for POI data, the secondary labels collected through manual classification are used to construct a feature cloud for primary classification labels. Secondary labels typically represent finer-grained classifications of primary labels, which can better match the data. This method borrows from multi-label classification principles to tackle the problem of insufficient accuracy in single-layer label classification. The algorithm classifies the data using secondary labels, and then compares the predicted labels of secondary classifications belonging to the same primary category to obtain the average value, which is then compared with the predicted labels of the primary category. If the accuracy of primary classification is low, the predicted labels of the data are updated to the predicted labels of the secondary labels. For long-tail data where the number of labels is too small, the feature cloud has a larger distribution range, which can improve the recognition accuracy caused by insufficient number of tail labels.

Algorithm1: Dual-layer Label Perception Algorithm for POI Data

Input: Data, primary labels, secondary labels

Output: Predicted label$\widehat{Y}$

1. Predicted label $\widehat{Y}$is obtained through training based on Fig. 1 for primary classification labels, serving as the initial predicted label $\widehat{Y}$

2. Train the model using secondary classification labels based on Fig. 1, and compute the predicted label ${\widehat{\gamma }}_{2}$by averaging the weighted classification thresholds corresponding to the secondary labels

3. For each POI data :

4. IF the classification threshold BB of primary labels ${\widehat{{\gamma }}}_{1}$< 0.6

5. IF the classification threshold of secondary labels ${\widehat{\gamma }}_{2}-{\widehat{{\gamma }}}_{1}$>0.1

6. Assign the predicted label ${\widehat{\gamma }}_{2}$ of secondary labels as the predicted label $\widehat{Y}$

7. Update the labels to the primary labels to which the secondary labels belong

8. end for

9. Obtain the new predicted label $\widehat{Y}$

This study utilizes PyTorch for implementing the network architecture, employing a pre-trained BERT model to extract essential deep features throughout the network training process. All experiments were carried out on a Dell T630 deep learning GPU tower server, featuring a Tesla K40 GPU model, 64GB DRAM memory, and dual E5 processors.

The data utilized in this study comprises annotations from city maps, sourced from various entities' POI data. Given the stringent demands of training on such data, the preprocessing of initialization data emerges as particularly pivotal, as it profoundly influences the eventual outcomes of subsequent training. The POI text from Excel undergoes conversion into textual data through custom-scripted tools. Moreover, the 189 subcategories are reclassified into 15 categories for annotating map data, including Agriculture, Forestry, and Fishing; Village Communities; Finance and Insurance; Accommodation; Health and Social Security; Transportation; Automobile Sales and Service; Commercial Facilities and Business Services; Education and Culture; Sports and Recreation; Public Facilities; Catering; Companies and Enterprises; Resident Services; Wholesale and Retail. The distribution of POI data is depicted in Fig. 1.

Evaluation indicators and comparative models

This study assesses a range of models using prediction accuracy (accuracy), F1-score macro average, precision macro average, and recall macro average as evaluation metrics, presenting the associated data.

Within this section, the study initiates text classification experiments on POI models. Moreover, to highlight the comparative efficacy, it documents the experimental outcomes of various intermediary models for comparison purposes. These models are detailed below. Additionally, during model experimentation, the study opts for the preprocessing data yielding the highest classification performance, thereby abstaining from conducting multiple comparative model experiments.

TextCNN employs the renowned CNN text classification methodology introduced by Kim in 2014, wherein POI data undergoes transformation into word vectors for input. The model architecture is fashioned upon a CNN neural network, integrating a singular convolutional pooling layer alongside two fully connected layers to streamline dimensions and extract conclusive outcomes.

In BiLSTM + Attention, the data is first embedded and then fed into BiLSTM. Subsequently, the output of the last position is directed to the fully connected layer for softmax classification.

In Bert, only the pre-trained BERT model is employed, with the following parameter settings: self.optimizer = Adam or AdamW, where potential disparities may arise due to insufficient data. Additionally, self.num_epochs = 3, self.batch_size = 32, self.learning_rate = 2e-5.

ERNIE: based on a Transformer structure akin to the BERT model, tackles the challenge of BERT randomly masking individual tokens, thus preventing the loss of phrase and entity information. Its strength lies in its training on a vast Chinese corpus, where the abundance of high-quality Chinese data contributes to superior performance across various Chinese training methodologies.

Bert + CNN, Bert + RNN: Using the pre-trained Bert model, POI data is transformed into word vectors, which serve as input. Bert acts as the embedding layer and is fed into the constructed CNN or RNN model. Subsequently, a convolutional pooling layer along with two fully connected layers are employed to reduce dimensionality and derive the final results.

In the realm of Bert + DPCNN, the DPCNN architecture, through the gradual augmentation of its network depth, demonstrates its prowess in extracting nuanced long-range textual dependencies, thus augmenting the efficacy of TextCNN. In parallel, Bert fulfills the role of an embedding layer, seamlessly integrated into the pre-established DPCNN framework. Following this integration, a convolutional pooling layer alongside two fully connected layers are employed to condense dimensions, culminating in the realization of the ultimate outcomes.

Analysis of experimental results

Table 2

Comparison of data results under data cleaning and data preprocessing
Model	Precision	Recall	F1-Score
Bert(Initial data)	68.23	69.12	68.67
Bert	73.90	74.45	74.55

Table 2 reveals a discernible enhancement in all three metrics subsequent to the aforementioned preprocessing steps when contrasted with the baseline metrics pre-processing. This augmentation arises from the refinement of data quality through preprocessing, thereby culminating in superior classification outcomes. Notably, the brevity of labels within the POI data segment poses a challenge in delineating semantic distinctions. Instances such as "Mofang," characterized by ambiguous semantic connotations, hinder the attainment of robust classification results, thus unveiling an intrinsic limitation inherent in the dataset.

Table 3

Comparison of data results between TextCNN, BiLSTM models, and Transformer based models
Model	Precision	Recall	F1-Score
TextCNN	67.71	68.12	67.91
BiLSTM + Attention	66.82	67.76	67.29
Bert	73.90	74.45	74.56
ERNIE	74.28	73.59	73.93

From Table 3, it becomes evident that upon scrutinizing various advanced algorithms delineated earlier, the conventional TextCNN algorithm still exhibits commendable segmentation prowess. Nonetheless, as previously alluded to, the suboptimal data quality impedes the attainment of significantly elevated discrimination accuracy during training. Leveraging the Transformer architecture of BERT and ERNIE, both Precision and F1 Score evince superior discrimination capabilities vis-à-vis traditional methodologies. Empirical evidence underscores the supremacy of deep learning over conventional machine learning approaches in short text classification, attesting to its superior accuracy and data model stability. Furthermore, Transformer-based models obviate the need for manual feature extraction, enabling expedited processing of voluminous data with heightened efficiency and efficacy relative to TextCNN counterparts. Thus, in the realm of natural language processing classification, Transformer-based models substantiate their superiority. Specifically, in addressing the short text classification quandary delineated in this discourse, Transformer-based models outperform CNN and BiLSTM neural networks.

Table 4

Comparison of data results between the BERT optimization model and the dual-layer label perception algorithm proposed in this paper
Model	Precision	Recall	F1-Score
Bert + CNN	73.44	74.40	73.92
Bert + RNN	73.57	74.96	74.26
Bert + RCNN	73.57	74.96	74.26
Bert + DPCNN	73.47	74.52	73.99
ERNIE + Ours	78.59	78.45	78.52
Bert + Ours	79.26	79.57	79.41

To further assess the dataset's performance across different pre-trained models and to gauge the efficacy of the proposed double-layer label perception algorithm, an optimized BERT pre-trained model was introduced for comparative analysis while maintaining other parameters unchanged. The implementation of these methodologies exhibits minimal impact on the dataset, underscoring the relative optimality of short text classification within feature extraction algorithms. To address the challenge of aligning short texts with long-tail distributions during feature space mapping, secondary labels were employed to establish a shared feature cloud. Moreover, the excessive proliferation of classification items significantly streamlined the training process. Long-tail distribution datasets typically involve training documents spanning both head and tail labels. However, the disparate number of training instances between head and tail labels impairs the overall training process. Consequently, the classifier interface disproportionately favors the head label classifier, yielding satisfactory classification outcomes for head labels but suboptimal results for tail labels. It is evident that the algorithm proposed in this study has led to varying degrees of improvement across various metrics within the same dataset. Notably, compared to Bert, this method has exhibited a 5% increase in F1 index, Precision, and Recall. These results underscore the algorithm's adept utilization of double-layer labels and the construction of a feature cloud tailored to long-tail data. This approach proves particularly efficacious in capturing label features when data volume is limited, yielding commendable performance.

Table 4

Comparison of data results for different categories between Bert model and Bert + Hours model
label ID	Sector Category	Precision	Recall	F1-Score	Improvements(F1-Score)	Sample Size
1	Rural Communities	0.00	0.00	0.00	0.73	5
2	Foodservice Industry	0.80	0.75	0.77	0.80	357
3	Accommodation Services	0.91	0.78	0.84	0.88	37
4	Wholesale and Retail Trade	0.70	0.77	0.73	0.77	676
5	Automotive Sales and Services	0.60	0.66	0.63	0.68	50
6	Financial and Insurance Activities	0.63	0.63	0.63	0.70	27
7	Education and Cultural Services	0.68	0.69	0.69	0.77	94
8	Healthcare and Social Security Services	0.72	0.80	0.76	0.82	45
9	Sports and Recreational Services	0.65	0.80	0.72	0.78	146
10	Public Infrastructure	0.85	0.89	0.87	0.90	267
11	Commercial Facilities and Business Services	0.59	0.67	0.63	0.71	72
12	Residential Services	0.75	0.62	0.68	0.76	546
13	Corporate and Industrial Operations	0.83	0.77	0.80	0.82	413
14	Transportation Services	0.61	0.93	0.74	0.94	45
15	Agriculture, Forestry, Animal Husbandry, and Fisheries	1.00	0.50	0.67	0.67	2

Within the identical training dataset, this paper's proposed algorithm and the baseline algorithm are juxtaposed and enhanced across various labels, with the optimal outcomes highlighted in bold. F1-score for each label are juxtaposed against the baseline. The double-layer label perception algorithm advanced in this study demonstrates superior performance across all labels, particularly excelling on labels characterized by a scarcity of nodes, such as 7, 11, 13, and 14, in instances where long-tail data is scant. Notably, for labels featuring only single-digit sample sizes, such as 15, the algorithm did not yield superior results; however, it did not compromise the accuracy of the BERT baseline algorithm either. Despite enhancing performance, the algorithm mitigates the risk of diminishing the F1 scores for individual labels. Consequently, it is deduced that by instituting a double-layer label perception algorithm featuring first-level and second-level label feature clouds, requisite node information for labels can be extracted. Drawing inspiration from the principles of multi-label classification, it adeptly addresses the challenge of classifying long-tail data nodes.

This paper introduces a POI double-layer label perception algorithm aimed at ameliorating the issue of low accuracy in POI data classification within industrial production settings. Multiple text classification models are experimentally employed to refine the accuracy of POI text classification, thereby curbing manual classification costs and substantially enhancing the efficiency of manual annotation production. This breakthrough holds relevance for various industries in urban and county settings, including projects like the "TianDiTu" (Map World) Project, and the compilation and revision of place names and addresses across diverse counties and districts. The paper presents a method to elevate the automation level of POI text production, mitigating human errors and bolstering the accuracy of POI text classification. Adaptations and enhancements to the functions can be made in accordance with the specific requirements of other modules, rendering this technology applicable to other databases necessitating text classification, thus amplifying the efficiency of each business module.

This paper initiates its investigation from the vantage point of the POI data classification conundrum, undertaking an initial foray into the broader domain of POI data concerns. It delves into the utilization of advanced algorithms within specialized domains, laying the groundwork for further exploration. Subsequent research endeavors will delve into additional methodologies to enhance data quality. However, this study is not without its constraints: there remains scope for refinement in the text classification methodology, with certain labels recurring and exhibiting ambiguous semantics. The introduction of classification rules could ameliorate overall classification efficiency.

Author Contribution

The first author mainly completed the code running and testing parts of the experiment, as well as the overall writing of the paperThe corresponding author is mainly responsible for proofreading, checking and revising article errors, innovation, and designing parts of the experiment

Data Availability

Data is provided within the manuscript or supplementary information files

Maron M E. Automatic Indexing: An Experimental Inquiry [J]. Journal of the ACM (JACM), 1961,8(3): 404-417.
Cover T, Hart P. Nearest Neighbor Pattern Classification [J]. IEEE Transactions on Information Theory,1967, 13(1): 21-27.
Loh W Y. Classification and Regression Trees [J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discover Natural Language Processing. EMNLP, 2014: 25-29.
Chen G, Ye D, Xing Z, et al. Ensemble application of convolutional and recurrent neural networks for multi-label text categorization [C]. In Proceedings of the International Joint Conference on Neural Networks, 2017: 2377-2383.
Quinlan J R. C4.5: Programs for Machine Learning [M]. Elsevier, 2014.
Joachims T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features[C]//Proceedings European Conference on Machine Learning. Springer, Berlin, Heidelberg, 1998: 137-142.
Breiman L. Random Forests [J]. Machine Learning, 2001, 45(1): 5-32.
Chen T, Guestrin C. Xgboost: A Scalable Tree Boosting System[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794.
Y.Kim.Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. EMNLP, 2014: 25-29.
Zhou X, Wan X, and Xiao J. Attention-based LSTM Network for Cross-Lingual Sentiment Classification[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. EMNLP, 2016: 247-256.
Devlin J, Chang M W, Lee K, et al. Bert: pre-trained of Deep Bidirectional Transformers for Language Understanding [J]. ArXiv Preprint ArXiv: 1810.04805, 2018.
Chalkidis I, Fergadiotis M, Malakasiotis P, et al. Large-scale Multi-Label Text Classification on EU Legislation[C]//Proceedings of the 57th Conference of the Association for Computational Linguistics. ACL, 2019: 6314-6322.
Sun C, Qiu X, Xu Y, et al. How to Fine-Tune Bert for Text Classification? [C]//China National Conference on Chinese Computational Linguistics. Springer, Cham, 2019: 194-206.
Yang Z, Dai Z, Yang Y, et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding [J]. Advances in Neural Information Processing Systems, 2019, (32): 8-14.
Liu Y, Ott M, Goyal N, et al. Roberta: A Robustly Optimized Bert Pretraining Approach [J]. ArXiv Preprint ArXiv: 1907.11692, 2019.
Sun Y , Wang S , Li Y ,et al.ERNIE: Enhanced Representation through Knowledge Integration[J]. 2019.DOI:10.48550/arXiv.1904.09223.
Yao L, Mao C, Luo Y. Graph Convolutional Networks for Text Classification [C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019: 7370-7377.
Peng H, Li J, He Y, et al. Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-cnn [C]//Proceedings of the 2018 World Wide Web Conference. 2018: 1063-1072.
Wu F, Zhang T, Souza A, et al. Simplifying Graph Convolutional Networks [C]//Proceedings of the 36th International Conference on Machine Learning (ICML). 2019: 6861-6871.
Peng H, Li J, Wang S, et al. Hierarchical Taxonomy-aware and Attentional Graph Capsule RCNNs for Large-scale Multi-label Text Classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2019, 33(6): 2505-2519.
Chen Y S, Chiang S W, Wu M L. A Few-Shot Transfer Learning Approach Using Text-Label Embedding with Legal Attributes for Law Article Prediction[J]. Applied Intelligence, 2022, 52(3): 2884-2902.
Miro czuk M M, Protasiewicz J. A Recent Overview of the State-of-the-Art Elements of Text Classification [J]. Expert Systems with Applications, 2018, 106: 36-54.
Minaee S, Kalchbrenner N, Cambria E, et al. Deep Learning-Based Text Classification: A Comprehensive Review [J]. ACM Computing Surveys (CSUR), 2021, 54(3): 1-40.
Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique [J]. Journal of Artificial Intelligence Research, 2002, 16: 321–357.
Buda M, Maki A, Mazurowski M A. A systematic study of the class imbalance problem in convolutional neural networks [J]. Neural Networks, 2018, 106: 249–259.
Han H, Wang W-Y, Mao B-H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning [C]. In Proceedings of the International Conference on Intelligent Computing, 2005: 878–887.
Drummond C, Holte R C, et al. C4. 5, class imbalance, and cost sensitivity: why under sampling beats over-sampling [C]. In Proceedings of the International Conference on Machine Learning Workshop on Learning from Imbalanced Datasets II, 2003: 1–8.
Byrd J, Lipton Z. What is the effect of importance weighting in deep learning? [C]. In International Conference on Machine Learning, 2019: 872–881.
Lin T-Y, Goyal P, Girshick R, et al. Focal loss for dense object detection [C]. In Proceedings of the IEEE International Conference on Computer Vision, 2017: 2980–2988.
Cao K, Wei C, Gaidon A, et al. Learning imbalanced datasets with label-distribution-aware margin loss [C]. In Proceedings of the International Conference on Neural Information Processing Systems, 2019: 1567–1578.
Tan J, Wang C, Li B, et al. Equalization loss for long-tailed object recognition [C]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 11662–11671.
Huang C, Li Y, Loy C C, et al. Learning deep representation for imbalanced classification [C]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 5375–5384.
Liu J, Sun Y, Han C, et al. Deep representation learning on long-tailed data: A learnable embedding augmentation perspective [C]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 2970–2979.
Wang Y-X, Ramanan D, Hebert M. Learning to model the tail [C]. In Proceedings of the Advances in Neural Information Processing Systems, 2017: 7029–7039.
Hariharan B, Girshick R. Low-shot visual recognition by shrinking and hallucinating features [C]. In Proceedings of the IEEE International Conference on Computer Vision,2017: 3018–3027.
Gidaris S, Komodakis N. Dynamic few-shot visual learning without forgetting [C]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018: 4367–4375.
Qi H, Brown M, Lowe D G. Low-shot learning with imprinted weights [C]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 5822–5830.
Snell J, Swersky K, Zemel R. Prototypical networks for few-shot learning [C]. In Proceedings of the Advances in Neural Information Processing Systems, 2017: 4077–4087.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Dual-layer Label Perception Algorithm for POI data with long-tailed distribution

Status:

Version 1

Abstract

Figures

INTRODUCTION

Related work

Text classification algorithm

Learning algorithms for long tailed distributions

POI text classification modeling process

Overall algorithm framework

Bert structure

Experiment and Analysis

Evaluation indicators and comparative models

Conclusion

Declarations

Author Contribution

Data Availability

References

Additional Declarations

Status:

Version 1