FPM: A Collection of Large-scale Foundation Pre-trained Language Models

Large-scale Transformer models have significantly promoted the recent development of natural language processing applications. However, little effort has been made to unify the effective models. In this paper, driven by providing a new set of baseline models in the future, we adopt various novel transformer architectures and launch a model set with the help of recent mainstream technologies. We focus the discussions on optimizing the depth of the networks based on the existing powerful encode-decoder structures. We show that by properly avoiding training defects such as non-convergence and degradation, scaling up off-the-shelf transformer architectures consistently delivers better performance. To stimulate future research on large-scale language model pretraining, we present extensive results and detailed discussions on network performance improvements with respect to the network depth and confirm the existence of the optimal number of layers under specific tasks. To the best of our knowledge, we provide the largest Chinese generative model and the largest Chinese encoding model. The BERT language models we trained on English datasets deliver a 14.45% higher F1 score than the Turing-NLR.


Introduction
Thanks to the increasing availability of hardware and size of datasets, natural language processing has developed rapidly in recent years. The massive resources of computation and data make it possible to train large-scale language models through self-supervised pre-training [10]. As the model size increases, the memory consumption can exceed the limit of the modern hardware. Fortunately, modern developments on distributed training have enabled models with hundreds of millions or even tens of billions of parameters to be trained on a large number of GPUs in parallel. And now there are many techniques, such as layer normalization [1] and residual connection [5] in Transformer layers [17], to eliminate model degradation when scaling up the training models.
However, little effort has been made to unify the effective models. In this paper, driven by providing a new set of baseline models in the future, we adopt various novel transformer architectures and launch a model set with the help of recent mainstream technologies.
In summary, the contributions of this article are as follows: -We train a language model with 10.3 billion parameters, which to the best of our knowledge, is the largest Chinese generative model. -We train a BERT language model with 495 million parameters, which to the best of our knowledge, is the largest Chinese encoding model. -We train a GPT-2 language model with 6.4 billion parameters, which to the best of our knowledge, is the largest English generative model. -We have trained a list of BERT language models, which to the best of our knowledge, exceeds the Turing-NLR's F1 score by 14.45% on English datasets. -We observe that the best result of Quora-Question-Pairs(QQP) from GLUE does not come along with the largest 90-layer-BERT-E-E model, and the 70-layer-BERT-E-L model achieves the state-of-the-art result with a 0.5% accuracy improvement compared to the 90-layer-BERT-E-E model.
Autoregressive language models predict the next word x, given all the previous words x 1 , x 2 , . . . , and x i−1 . The training goal is to maximize the log like- where θ is the model parameter. Typical autoregressive language models include GPT, GPT-2 [12] and GPT-3 [2].
Mask language models use the special token [MASK] to randomly select the word to be masked, or replace it with a random token. This architecture forces the model to collect bidirectional information when making predictions. Popular representations of MLM include BERT [4] and RoBERTa [8]. Specifically, MLMs such as BERT use Transformer encoder architecture. Like the autoregressive model, MLM stacks multiple Transformer encoder layers to learn increasing complex and meaningful representations, but when learning the representation of a specific token, it uses masked self-attention to focus on all others in the sequence token.

Architecture
Based on the GPT, BERT, Transformer network structure, we extend the GPT, BERT, and Transformer network structure to different numbers of the network layers. We study the performance differences among the models by only changing the number of layers.

GPT
In experiment of the original GPT article, Radford et al. use multi-layer transformer decoders in the language model, which is a variant of Transformer. It applies a multi-head self-attention operation to the input tokens, and a position feed-forward layer to generate the output distribution on target tokens. We study the performance impact of the model depth by testing the CPM model with 36, 64, 128 layers and the EPM model with 36, 50, 64 and 80 layers. All other hyper-parameters are identical with the original models. We expanded network layers number from 32 layers to 36, 64, and 128 for the CPM model, and we have expanded layers from 32 to 36, 50, 64, and 80 for the EPM model.

BERT
BERT is a pre-trained Transformer network that sets up the latest state-ofthe-art results for various natural language processing tasks, including question answering, sentence classification, and sentence pair regression.
The BERT-Large model mentioned in the paper by Devlin et al. consists of 24 self-attention layers. In this study, we extend the original network from 24 layers to 50, 60, 70, 80, and 90 layers, with all other hyper-parameters unchanged. The performance impact of the model depth is also discussed in Table-3.

Transformer
CPM-2 is a standard Transformer model that combines a bidirectional encoder and a directional decoder. In order to reduce memory consumption and accelerate pre-training, Zhang et al. use mixed-precision training, gradient checkpointing, and zero stage optimization [14].
Most powerful neural sequence transduction models have an encoder-decoder structure. Transformer [17] follows this overall architecture, and uses stacked self-attention and point-by-point fully connected encoder and decoder layers.
We reduce the original CPM-2 model from 48 layers of the Transformer unit to 12 and 24 layers. All other hyper-parameters in our model are consistent with the CPM-2 model in the original paper. Moreover, we train the 12-layer EPM-2 model on English datasets.

Setup
In this work, we focus on three models: GPT-2, BERT, CPM-2, a language model based on a left-to-right generative transformer, a bidirectional transformer model based on masking, an encoder-decoder language model, respectively.
Training Datasets A dataset should be as large as possible with high quality first, the classification should be as even as possible, and the content of the data should be as clean as possible. This work only focuses on model training, and uses existing large-scale datasets, the diversified English Pile dataset and the Chinese Wudao dataset.
GPT-English word segmentation uses GPT-2 English vocabulary, containing 30,000 symbols. OpenAI team used 40GB of text and 8 million documents collected by WebText corpus. WebText is a web page curated and filtered by Ope-nAI humans. All outbound links with a rating of at least 3 karma are crawled from Reddit. The generated dataset WebText contains a text subset of 45 million links. In the cleanup phase, links created after December 2017 were removed, and after deduplication and some heuristic-based cleanup, slightly more than 8 million documents were left, a total of 40GB of text, and all Wikipedia documents were deleted.
GPT-Chinese word segmentation uses CPM Chinese vocabulary, containing 30,000 symbols. Zhang et al. use 100G multi-category texts, including encyclopedias, news, novels, and questions-answers. CPM Chinese vocabulary uses the unigram language model to build a new sub-word vocabulary based on a subword corpus. Since the length of input sequence is usually greater than the length of a single document, different documents are connected by adding the "end-ofdocument" symbol after each document. To make full use of input length, A new sub-word vocabulary is constructed in the vocabulary construction process, including commonly used words and characters. Considering that original BERT word segmentation will introduce an additional splitter between words, Zhang et al. set up a special token as a splitter to make the sub-word process reversible.
Transformer-Chinese word segmentation uses CPM-2 Chinese vocabulary. It contains 26,240 symbols and is trained by Zhang et al. using 2.3TB of cleaned Chinese data. CPM-2's vocabulary is modified based on Chinese byte-pair encoding (BPE). Original BPE inserts many redundant space marks " " in word segmentation sequence. Zhang et al. replace sentence segmentation device with a combination of word segmentation device and the stammering word segmentation, and deletes it Inserted spaces. Since it does not matter whether symbols in vocabulary appear at the beginning of the word, tags like "happy" (happy) and " happy" ( happy) are merged into a single token "happy" to simplify the vocabulary.
Transformer-English word segmentation, using CPM-2 English vocabulary, contains 29,752 symbols, which is trained by Zhang et al. using 300GB of cleaned English data. CPM-2's vocabulary is modified based on English BPE. Original BPE inserts many redundant space tokens " " in word segmentation sequence. Zhang et al. replaced the sentence tokenizer with word tokenizer combined with nltk word segmentation, and deleted inserted spaces. English data comes from multiple fields, including encyclopedias, novels, Q&A, scientific literature, e-books, news, and reviews.
We We use an 800GB corpus from Pile. Due to the limitation of computing resources, not all partitions was selected in the process. We use only four blocks of 02, 03, 04, 17 about 213G data. Two types of the corpus, StackExchange and Github, which contain many impurities, are included, and the entire dataset is about 200G.
We use 3TB Chinese corpus collected by Yuan et al. In the process of using Wudao corpus, due to the limitation of computing resources, we use 200G corpus. Wudao dataset uses 3 billion web pages as an original data source to extract text content from web pages with high text density. Evaluate the quality of each data source before extracting the text, and ignore web pages with text density below 70%.
The Chinese dialogue data from STC-680M corpus dataset, contains approximately 4.4 million conversations from Weibo. To build this million-scale dataset, they first grab hundreds of millions of response pairs and then filter out potential responses by deleting trivial answers such as "wow", Advertisements, and delete the content after the first 30 responses to keep the theme consistent with cleaning up original data.

Training Optimization and Hyperparameters
To effectively train the model, in some experiments, mixed-precision training and dynamic loss scaling are removed to use a tensor core of A100. However, in some experiments, training fails to converge due to accuracy, and we remove the mixed-precision training method. First initialize weights with a simple normal distribution. Then scale the weights immediately before the residual layer, where N is the number of transformer layers composed of self-attention and Multilayer Perceptron (MLP) blocks. For optimizer, use Adam with a weight decay of 0.01. In addition, a 1.0 global gradient norm crop is used to improve the stability of training large models. In all cases, use a dropout of 0.1. Finally, to better manage the memory footprint, activation checkpoints are used after each transformer layer.
For GPT-2 model, we make all training experiments using 1024 symbol sequences, and batch size is 512, and 300k iterations are performed. We preheat 1.5e-4 learning rate for 3k iterations, and then a single-loop cosine decay is performed in the remaining iterations. Stop attenuation at the minimum learning rate of 1e-5.
For BERT, we follow the training process described in original paper. Using original BERT dictionary, vocabulary size is 30,522. In addition, follow suggested sentence order prediction to reposition the next sentence prediction head and use the whole word n-gram mask. For all cases, set batch size to 1024, and use a learning rate of 1.0e-4, warm up in 10,000 iterations and decay linearly in remaining iterations. Other training parameters remain unchanged.

Model Details
The infrastructure is optimized for multi-node deep learning applications. All experiments use up to 2 DGX servers (a total of 16 A100 SXM3 40GB GPUs). Through NVSwitch, we achieved a bandwidth of 300GB/sec between GPUs in servers and achieved a bandwidth of 10GB/sec between servers that use 1 In-finiBand adapter per server.
We describe the most significant models as follows: -Based on GPT-2, we trained a 10.3-billion-parameter model on Chinese datasets and we trained a 2.9-billion-parameter model on a dialogue corpus.
We trained a BERT model with 495 million parameters on Chinese datasets. Moreover, we trained a Transformer model with 5.6 billion parameters on Chinese datasets. -We apply the corresponding training work for English. Using the GPT-2 model, we trained a model with 6.4 billion parameters on English datasets. We trained a BERT model with 1.24 billion parameters on English datasets and we trained a language model with a 688 million parameter on one GPU. We trained a Transformer model with 2.9 billion parameters on English datasets.
EPM-X Based on GPT-2 model, we have built an encoder-decoder language model EPM-2-X, using GPT-English tokenizer, and trained a language model with 2.9 billion parameters. It has a 12-layer network structure, 6 encoding layers, and 6 decoding layers.
EPM-2-X Based on Transformer model, we have built an encoder-decoder language model EPM-2-X, using Transformer-English tokenizer, and trained a language model with 2.9 billion parameters. It has a 12-layer network structure, 6 encoding layers, and 6 decoding layers.
BERT-E Based on BERT model, we built encoding language model BERT-E. Using BERT-English word tokenizer, we trained five models with different layers. The largest model is a language model with 1.24 billion parameters. It has 90 layers of Transformer encoding layers. The network is cascaded, named BERT-E-E. The second E here means Extreme.
BERT-X-EN Based on BERT model, we have built encoding language model BERT-X-EN. Using BERT-English word segmentation, we trained two models with different layers. The largest model is 690 million parameters, includes 48 layers of Transformer encoding layer stack, named BERT-X-EN-M.
CPM-X Based on GPT-2 model, we have built a generative language model CPM-X, using GPT-Chinese word segmentation, and trained 3 models with different sizes. The largest model is a 10.3 billion parameter language model, which stacks 128 layers Transformer decoding layer, named CPM-X-L. Based on GPT-2 model, we have built a generative language model CPM-X-EVA, using GPT-Chinese word segmentation, using STC dialogue data, a language model of 2.9 billion parameters, which stacked 36 layers of Transformer decoding layer, named CPM-X-EVA.
CPM-2-X Based on Transformer model, we built Transformer language model CPM-2-X. We trained 2 models, the largest of which uses Transformer-Chinese word segmentation, trained a 5.6 billion parameter language model. It has a 24-layer network structure, 12 encoding layers and 12 decoding layers, named CPM-2-X-M.
BERT-C Based on BERT model, we built encoding language model BERT-C. Using a BERT-Chinese word tokenizer, we trained a language model with 330 million parameters. It has a 24-layer Transformer encoding layer network cascaded, named BERT-C.
BERT-X-CN Based on BERT model, we built encoding language model BERT-X-CN. Using a BERT-Chinese word tokenizer, we trained a language model with 495 million parameters, including 36 layers of Transformer encoding layers, named BERT-X-CN-S.

Evaluation
In evaluation stage, we evaluate five BERT-E models on the QQP classification task from GLUE and evaluate BERT-C model on the TNEWS classification task from CLUE.

Discussion
We will discuss language, scale, network configuration, and cost for our training models. Furthermore, we will discuss the optimal layer number for the GLUE-QQP task.

Language, Scale and Architecture
From Table-2, this work has trained many Chinese and English corpora and conducted detailed training.
The models with the least number of layers are CPM-2-X and EPM-2-X, with only 12 layers. Model with the most significant number of layers is CPM-X, with 128 layers trained. From perspective of the difficulty of training, GPT structure is easier to cascade the number of layers. While Transformer structure is not easy to increase the number of layers, it is easier to reach the upper limit of hardware memory. From a perspective of model parameters, model with the least amount of parameters is BERT-C, with only 330 million, and the most significant parameter is CPM-X-L, with parameters reaching 10.3 billion. For BERT, GPT, and Transformer, our network structure parameters used in this work are fixed, and we use parameters in the original paper without significant modifications.

Cost
We observe that in terms of time consumption, the shortest training time is CPM-X-S, and the longest time is CPM-X-EN from Table-4. The least training step number is CPM-X-S, only 100,000 steps. Moreover, the largest is CPM-X-EN with 2.8 million steps. From the perspective of computing power, the least computing resource is CPM-X-S, which uses 27 Eflops, and the most computing power is CPM-2-X-M, which uses 1,240 EFlops.

Optimal Layers
From Table-3, we observe that the best result does not come with the largest 90-layer-BERT-E-E model, the 70-layer-BERT-E-L model achieved the state-ofthe-art result with about 0.5% gain to the 90-layer-BERT-E-E model. We can conclude that for the GLUE-QQP task, the optimal layer number for BERT models is 70. We conclude the result from the comparison with the current GPU hardware limits. In the future, the optimal layer number for the GLUE-QQP task might change with the help of advanced GPU servers.

Conclusion
We focus the discussions on optimizing the depth of the networks based on the existing powerful encode-decoder structures. We observe that the best result of QQP from GLUE does not come with the largest 90-layer-BERT-E-E model, the 70-layer-BERT-E-L model achieved the state-of-the-art result with about 0.5% gain to the 90-layer-BERT-E-E model.

Future Work
With the development of hardware and software, we will try more efficient model tricks for the deeper network layers and train more improved Chinese and English models based on BERT and GPT in future. In addition, we will explore the optimal structure in models for different numbers of model layers in accuracy.