GLM-130B: An Open Bilingual Pre-Trained Model

« previous post | next post »

Description of a General Language Model (GLM; also GLaM) project based at Tsinghua University in Beijing, but with users and collaborators around the world.

Homepage (August 4, 2022)

This prospectus is difficult for outsiders to understand because of the large number of unexplained acronyms, abbreviations, initialisms, etc. and other such participants' terminology.

GLM-130B is an open bilingual (English & Chinese) bidirectional dense model with 130 billion parameters, pre-trained using the General Language Model (GLM) algorithm1. It is designed to support inference tasks with the 130B parameters on a single A100 (40G * 8) or V100 (32G * 8) server. As of July 3rd, 2022, GLM-130B has been trained on over 400 billion text tokens (200B each for Chinese and English) and exhibits the following unique features:

    • Bilingual: supports both English and Chinese.
    • Performance (EN): better than GPT-3 175B (+5.0%), OPT-175B (+6.5%), and BLOOM-176B (+13.0%) on LAMBADA and slightly better than GPT-3 175B (+0.9%) on MMLU.
    • Performance (CN): significantly better than ERNIE TITAN 3.0 260B on 7 zero-shot CLUE datasets (+24.26%) and 5 zero-shot FewCLUE datasets (+12.75%).
    • Fast Inference: supports fast inference on both SAT and FasterTransformer (up to 2.5X faster) with a single A100 server.
    • Reproducibility: all results (>30 tasks) can be easily reproduced with open-sourced code and model checkpoints.
    • Cross-Platform: supports training and inference on NVIDIA, Hygon DCU, Ascend 910, and Sunway.

The model checkpoints of GLM-130B and code for inference are publicly available at our GitHub repo. The code for pre-training and fine-tuning as well as the research paper are coming soon.

I treat the above opening portion of the prospectus as a sort of introduction to the project.  Here follow these sections:

Figure 1. The performance of GLM-130B vs. models of similar scale on MMLU and LAMBADA.

Conceiving GLM-130B (describes the early history of the project and the challenges it faced; choice of pre-training algorithm; sponsorship and computational resources)

Figure 2.  Major Issues Encountered for Training GLM-130B (month-by-month listing from December, 2021 to July, 2022

The Performance of GLM-1308

Figure 3. Zero-shot performance on part of CLUE and FewCLUE benchmark datasets.

The GLM-139B Model (VHM:  this is actually more of an introduction than the prefatory material above, so I give here from it a brief, three sentence excerpt):

GLM-130B is a bilingual (English & Chinese) bidirectional language model with 130 billion parameters trained on over 400 billion text tokens. It is based on the General Language Model (GLM)1 architecture. GLM-130B leverages autoregressive blanking infilling as its primary pre-training objective.

Figure 4. Example: How GLM-130B is pre-trained on “Like a complete unknown, like a rolling stone

Future Work

Acknowledgements (click on the arrowhead to expand the list; some essential acronyms (e.g., KEG = Knowledge Engineering Group) are explained here)

References (mostly in computational linguistics, machine learning, neural information processing systems, language modeling, scaling, etc.)

War of the machines

 

Selected readings

[Thanks to Bill Benzon]



Comments are closed.