Image Credit: raindrop74 / Shutterstock
Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more
Beginning in earnest with OpenAI’s GPT-3, the focus in the field of natural language processing has turned to large language models (LLMs). LLMs — denoted by the amount of data, compute, and storage that’s required to develop them — are capable of impressive feats of language understanding, like generating code and writing rhyming poetry. But as an increasing number of studies point out, LLMs are impractically large for most researchers and organizations to take advantage of. Not only that, but they consume an amount of power that puts into question whether they’re sustainable to use over the long run.
New research suggests that this needn’t be the case forever, though. In a recent paper, Google introduced the Generalist Language Model (GLaM), which the company claims is one of the most efficient LLMs of its size and type. Despite containing 1.2 trillion parameters — nearly six times the amount in GPT-3 (175 billion) — Google says that GLaM improves across popular language benchmarks while using “significantly” less computation during inference.
“Our large-scale … language model, GLaM, achieves competitive results on zero-shot and one-shot learning and is a more efficient model than prior monolithic dense counterparts,” the Google researchers behind GLaM wrote in a blog post. “We hope that our work will spark more research into compute-efficient language models.”
Sparsity vs. density
In machine learning, parameters are the part of the model that’s learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well. DeepMind’s recently detailed Gopher model has 280 billion parameters, while Microsoft’s and Nvidia’s Megatron 530B boasts 530 billion. Both are among the top — if not the top — performers on key natural language benchmark tasks including text generation.
But training a model like Megatron 530B requires hundreds of GPU- or accelerator-equipped servers and millions of dollars. It’s also bad for the environment. GPT-3 alone used 1,287 megawatts during training and produced 552 metric tons of carbon dioxide emissions, a Google study found. That’s roughly equivalent to the yearly emissions of 58 homes in the U.S.
What makes GLaM different from most LLMs to date is its “mixture of experts” (MoE) architecture. An MoE can be thought of as having different layers of “submodels,” or experts, specialized for different text. The experts in each layer are controlled by a “gating” component that taps the experts based on the text. For a given word or part of a word, the gating component selects the two most appropriate experts to process the word or word part and make a prediction (e.g., generate text).
The full version of GLaM has 64 experts per MoE layer with 32 MoE layers in total, but only uses a subnetwork of 97 billion (8% of 1.2 trillion) parameters per word or word part during processing. “Dense” models like GPT-3 use all of their parameters for processing, significantly increasing the computational — and financial — requirements. For example, Nvidia says that processing with Megatron 530B can take over a minute on a CPU-based on-premises server. It takes half a second on two Nvidia -designed DGX systems, but just one of those systems can cost $7 million to $60 million.
GLaM isn’t perfect — it exceeds or is on par with the performance of a dense LLM in between 80% and 90% (but not all) of tasks. And GLaM uses more computation during training, because it trains on a dataset with more words and word parts than most LLMs. (Versus the billions of words from which GPT-3 learned language, GLaM ingested a dataset that was initially over 1.6 trillion words in size.) But Google claims that GLaM uses less than half the power needed to train GPT-3 at 456-megawatt hours (Mwh) versus 1,286 Mwh. For context, a single megawatt is enough to power around 796 homes for a year.
“GLaM is yet another step in the industrialization of large language models. The team applies and refines many modern tweaks and advancements to improve the performance and inference cost of this latest model, and comes away with an impressive feat of engineering,” Connor Leahy, a data scientist at EleutherAI, an open AI research collective, told VentureBe