AriaBERT: A Pre-trained Persian BERT Model for Natural Language Understanding

doi:10.21203/rs.3.rs-3558473/v1

Download PDF

Research Article

AriaBERT: A Pre-trained Persian BERT Model for Natural Language Understanding

https://doi.org/10.21203/rs.3.rs-3558473/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Pretrained language models have made significant progress in natural language processing (NLP), especially for Persian language. Due to the lack of diverse and suitable textual data in Persian language and consequently the absence of efficient pretrained language models, the obtained results in various Persian language tasks have failed to meet user expectations. In this regard, in order to improve the results obtained in various Persian language tasks, this article introduces AriaBert, a pretrained language model, as a state-of-the-art model in the field of Persian language understanding. To have diverse textual data for training the language model, various types of Persian textual data were collected and categorized into three groups: conversational, formal, and hybrid texts, including tweets, news, poems, medical texts, encyclopedia texts, user opinions on websites, and other types of texts. The total volume of the collected training data exceeds 32 gigabytes. In addition to the suitable training dataset, unlike most monolingual and multilingual language models that use the BERT base model, AriaBert utilizes the RoBERTa architecture and the Byte-Pair Encoding tokenizer. To evaluate the performance of AriaBert, this model has been compared with all important monolingual and multilingual Persian language models in various NLP tasks such as classification, sentiment analysis, and stance detection. The results indicate a significant superiority of AriaBert over other competitors in all mentioned language tasks. AriaBert has achieved an average improvement of 3% in sentiment analysis, 0.65% in classification, and 3% in stance detection compared to the best results obtained by other Persian language models.

Pre-trained language model

Transformer: Natural language understanding

Persian language

BERT

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

AriaBERT: A Pre-trained Persian BERT Model for Natural Language Understanding

Status:

Version 1

Abstract

Full Text

Additional Declarations

Status:

Version 1