Abstract
Recent work in language modeling has shown that train-
ing large-scale Transformer models has promoted the lat-
est developments in natural language processing applica-
tions. However, there is very little work to unify the cur-
rent effective models. In this work, we use the current ef-
fective model structure to launch a model set through the
current most mainstream technology.
We think this will
become the basic model in the future.
For Chinese, us-
ing the GPT-2[9] model, a 10.3 billion parameter language
model was trained on the Chinese dataset, and, in particu-
lar, a 2.9 billion parameter language model based on dia-
logue data was trained; the BERT model was trained on the
Chinese dataset with 495 million parameters; the Trans-
former model has trained a language model with 5.6 bil-
lion parameters on the Chinese dataset. In English, cor-
responding training work has also been done. Using the
GPT-2 model, a language model with 6.4 billion param-
eters was trained on the English dataset; the BERT[3]
model trained a language model with 1.24 billion param-
eters on the English dataset, and in particular, it trained a
688 million parameter based on single card training tech-
nology Language model; Transformer model trained a lan-
guage model with 5.6 billion parameters on the English
dataset.
In the TNEWS classification task evaluated by
CLUE[13], the BERT-C model exceeded the 59.46% accu-
racy of ALBERT-xxlarge with an accuracy rate of 59.99%,
an increase of 0.53%. In the QQP classification task evalu-
ated by GLUE[11], the accuracy rate of 78.95% surpassed
the accuracy rate of BERT-Large of 72.1%, an increase of
6.85%. Compared with the current accuracy rate of ERNIE,
the first place in the GLUE evaluation of 75.2%, an increase
of 3.75%.