FPM: A Collection of Large-scale Foundation Pre-trained Language Models

Abstract

Recent work in language modeling has shown that train- ing large-scale Transformer models has promoted the lat- est developments in natural language processing applica- tions. However, there is very little work to unify the cur- rent effective models. In this work, we use the current ef- fective model structure to launch a model set through the current most mainstream technology. We think this will become the basic model in the future. For Chinese, us- ing the GPT-2[9] model, a 10.3 billion parameter language model was trained on the Chinese dataset, and, in particu- lar, a 2.9 billion parameter language model based on dia- logue data was trained; the BERT model was trained on the Chinese dataset with 495 million parameters; the Trans- former model has trained a language model with 5.6 bil- lion parameters on the Chinese dataset. In English, cor- responding training work has also been done. Using the GPT-2 model, a language model with 6.4 billion param- eters was trained on the English dataset; the BERT[3] model trained a language model with 1.24 billion param- eters on the English dataset, and in particular, it trained a 688 million parameter based on single card training tech- nology Language model; Transformer model trained a lan- guage model with 5.6 billion parameters on the English dataset. In the TNEWS classification task evaluated by CLUE[13], the BERT-C model exceeded the 59.46% accu- racy of ALBERT-xxlarge with an accuracy rate of 59.99%, an increase of 0.53%. In the QQP classification task evalu- ated by GLUE[11], the accuracy rate of 78.95% surpassed the accuracy rate of BERT-Large of 72.1%, an increase of 6.85%. Compared with the current accuracy rate of ERNIE, the first place in the GLUE evaluation of 75.2%, an increase of 3.75%.

FPM: A Collection of Large-scale Foundation Pre-trained Language Models

Abstract

Full Text