In recent years, scholars at home and abroad have utilized deep learning, data mining, natural language processing, knowledge graphs, and other technologies to classify, mine, and analyze hazard text data. By employing ensemble learning, transfer learning, and reinforcement learning methods, the accuracy and effectiveness of hazard text data analysis have been improved. Text classification is a fundamental task in natural language processing (NLP), aiming to assign a piece of text to one or more predefined categories4. convolutional neural networks have limitations in extracting semantic features
With the rapid development of natural language processing technology, word vector-based deep learning models have been widely applied in short text classification across various fields, such as GloVe, FastText, ELMo5, and Transformer models. Maria Alejandra6 et al. proposed an ELMo-based model for text classification, which extracts contextual word vectors through the pre-trained ELMo model and extracts local features, finally making predictions through the classification layer. Experimental results show that this method significantly outperforms traditional word embedding methods in classification accuracy.
However, convolutional neural networks have limitations in extracting semantic features7, and many scholars have begun to try multi-model fusion methods for text classification. Jongga Lee8 investigated the impact of regularization on text classification models with limited labeled data. They compared a simple word embedding-based model with complex models (CNN and BiLSTM). Adversarial training improved supervised learning, while semi-supervised methods (Pi model, virtual adversarial training) enhanced performance with unlabeled data. Evaluating on four datasets (AG News, DBpedia, Yahoo! Answers, Yelp Polarity), they found that both simple and complex models benefit from regularization, with complex models showing significant improvements.
The Bidirectional Encoder Representations from Transformers (BERT) is a pre-training technique for natural language processing (NLP) proposed by Google in 20189. The initial English BERT release provided two types of pre-trained models: BERTBASE and BERTLARGE10. The core part of BERT is a Transformer model, with variable numbers of encoding layers and self-attention heads. Hao Wang11 et al. developed an efficient AI-generated text detection model based on the BERT algorithm, processing text with steps such as converting to lowercase, word splitting, and removing stop words. The model was trained and tested on a dataset split 60/40, showing an accuracy increase from 94.78–99.72% and a loss decrease from 0.261 to 0.021. The average training set loss was 0.0565, with a test set loss of 0.0917. The average accuracies were 98.1% for the training set and 97.71% for the test set, indicating good generalization. This BERT-based model demonstrates high accuracy and stability in detecting AI-generated text.
Although BERT performs excellently in text classification tasks, its high computational resource demand, long training time, large memory usage, slow inference speed, risk of overfitting, complex tuning, and poor interpretability need to be seriously considered in practical applications. This paper adopts a Hybrid CNN-Transformer model to classify coal mine accident hazard text data. The model uses CNN to extract local features and Transformer to capture global semantic information, thereby demonstrating outstanding performance in handling complex text classification tasks.