Google Trains a Trillion-Parameter Language Model

Google has announced in a recent research paper that it’s trained a trillion-parameter language model. The AI model is said to be the largest of its size to date.

Parameters are known as the key to machine learning algorithms. They are the part of AI that’s acquired through historical training data. The higher the number of parameters, the more sophisticated a machine learning algorithm could be.

For instance, the GPT-3 model that’s been developed by OpenAI has 175 billion parameters. With that figure, the said natural language model can complete basic codes and generate recipes, to name a few.

On the other hand, Google’s multi-layered bidirectional Transformer encoder known as BERT has 110 million parameters. It helps Google’s search engine to understand natural language text through entity recognition, part of speech tagging, and question-answering, among others.

Unfortunately, training parameters is not an easy feat. Aside from the excessive cost, training language model parameters also requires huge amount of resources and extreme computational power.

However, Google researchers were able to develop and benchmark techniques that allowed them to train a language model with trillions of parameters.

The Switch Transformers

In their paper, Google researchers William Fedus, Barret Zoph, and Noam Shazeer, cited that “large scale training has been an effective path towards flexible and powerful neural language models.”

The trio noted that with large datasets and parameter counts, even simple architectures can surpass more complicated algorithms. However, large-scale dataset training needs great computational capability.

To address this issue, the Google researchers developed a technique that only uses a subset of a language model’s weights. They called it the Switch Transformer.

The Switch Transformer was built upon the principle of maximizing the parameter count of a Transformer model through a simple and computationally efficient way.

It works by keeping multiple language models specialized in different tasks within a larger model. Then, a “gating network” will choose which expert models to consult for any given data.

The Switch Transformer leverages hardware developed for conducting matrix multiplications. Its distributed training setup splits unique weights on different devices. This makes it possible to increase the number of devices while maintaining a manageable memory and computational footprint on each device.

Training a Trillion-Parameter Language Model

For their experiment, the team designed two large Switch Transformer models. One contains 395 billion parameters and the other 1.6 trillion parameters.

The Google researchers used 32 TPU cores on the Colossal Clean Crawler Corpus and a 750 GB-sized dataset of text scraped from Wikipedia, Reddit, and other web sources to pretrain their Switch Transformers. They tasked each model to predict the missing words in passages where 15% of the words were masked.

Results showed that the team’s 1.6-trillion-parameter language model exhibited no training instability as compared to the one with 395 billion parameters. The researchers noted:

“Currently, the Switch Transformer translates substantial upstream gains better to knowledge-based tasks, than reasoning-tasks . Extracting stronger fine-tuning performance from large expert models is an active research question, and the pre-training perplexity indicates future improvements should be possible.”