Tech news
Monday, Jun 14, 2021

Why GPT-3 changes everything (and how it works)

A software program that ingests gigabytes of text can automatically generate whole paragraphs so natural they sound like a person wrote them. OpenAI’s GPT-3 is all the rage. What is it, what can it do, and where is it going?

GPT-3 is a computer program created by the privately held San Francisco startup OpenAI. It is a gigantic neural network, and as such, it is part of the deep learning segment of machine learning, which is itself a branch of the field of computer science known as artificial intelligence, or AI. The program is better than any prior program at producing lines of text that sound like they could have been written by a human.

The reason that such a breakthrough could be useful to companies is that it has great potential for automating tasks. GPT-3 can respond to any text that a person types into the computer with a new piece of text that is appropriate to the context. Type a full English sentence into a search box, for example, and you're more likely to get back some response in full sentences that is relevant. That means GPT-3 can conceivably amplify human effort in a wide variety of situations, from questions and answers for customer service to due diligence document search to report generation.

Observe the following brief example of what a person types into the computer, and how GPT-3 sends back a reply:

Human-supplied input : Q: Who played tess on touched by an angel?
GPT-3-generated completion : A: Delloreese Patricia Early (July 6, 1931 { November 19, 2017), known professionally as Della Reese
The program is currently in a private beta for which people can sign up on a waitlist. It's being offered by OpenAI as an API accessible through the cloud, and companies that have been granted access have developed some intriguing applications that use the generation of text to enhance all kinds of programs, from simple question-answering to producing programming code.

Along with the potential for automation come great drawbacks. GPT-3 is compute-hungry, putting it beyond the use of most companies in any conceivable on-premise fashion. Its generated text can be impressive at first blush, but long compositions tend to become somewhat senseless. And it has great potential for amplifying biases, including racism and sexism.

GPT-3 is an example of what's known as a language model, which is a particular kind of statistical program. In this case, it was created as a neural network.

The name GPT-3 is an acronym that stands for "generative pre-training," of which this is the third version so far. It's generative because unlike other neural networks that spit out a numeric score or a yes or no answer, GPT-3 can generate long sequences of the original text as its output. It is pre-trained in the sense that is has not been built with any domain knowledge, even though it can complete domain-specific tasks, such as foreign-language translation.

A language model, in the case of GPT-3, is a program that calculates how likely one word is to appear in a text given the other words in the text. That is what is known as the conditional probability of words.

For example, in the sentence, I wanted to make an omelet, so I went to the fridge and took out some ____, the blank can be filled with any word, even gibberish, given the infinite composability of language. But the word "eggs" probably scores pretty high to fill that blank in most normal texts, higher than, say, "elephants." We say that the probability of eggs on the condition of the prompted text is higher than the probability of elephants.

When the neural network is being developed, called the training phase, GPT-3 is fed millions and millions of samples of text and it converts words into what are called vectors, numeric representations. That is a form of data compression. The program then tries to unpack this compressed text back into a valid sentence. The task of compressing and decompressing develops the program's accuracy in calculating the conditional probability of words.

Once the model has been trained, meaning, its calculations of conditional probability across billions of words are made as accurate as possible, then it can predict what words come next when it is prompted by a person typing an initial word or words. That action of prediction is known in machine learning as inference.

That leads to a striking mirror effect. Not only do likely words emerge, but the texture and rhythm of a genre or the form of a written task, such as question-answer sets, is reproduced. So, for example, GPT-3 can be fed some names of famous poets and samples of their work, then the name of another poet and just a title of an imaginary poem, and GPT-3 will produce a new poem in a way that is consistent with the rhythm and syntax of the poet whose name has been prompted. 

Generating a response means GPT-3 can go way beyond simply producing writing. It can perform on all kinds of tests including tests of reasoning that involve a natural-language response. If, for example, GPT-3 is input an essay about rental rates of Manhattan rental properties, and a statement summarizing the text, such as "Manhattan comes cheap," and the question "true or false?," GPT-3 will respond to that entire prompt by returning the word "false," as the statement doesn't agree with the argument of the essay.

GPT-3's ability to respond in a way consistent with an example task, including forms it was never inputted before, makes it what is called a "few-shot" language model. Instead of being extensively tuned, or "trained," as it's called, on a given task, GPT-3 has so much information already about the many ways that words combine that it can be given only a handful of examples of a task, what's called a fine-tuning step, and it gains the ability to also perform that new task.

OpenAI has now become as famous -- or infamous -- for the release practices of its code as for the code itself. When the company unveiled GPT-2, the predecessor, on Valentine's Day of 2019, it initially would not release to the public the most-capable version, saying it was too dangerous to release into the wild because of the risk of mass-production of false and misleading text. OpenAI has subsequently made it available for download.

This time around, OpenAI is not providing any downloads. Instead, it has turned on a cloud-based API endpoint, making GPT-3 an as-a-service offering. (Think of it as LMaaS, language-model-as-a-service.) The reason, claims OpenAI, is both to limit GPT-3's use by bad actors and to make money.

"There is no 'undo button' with open source," OpenAI told ZDNet through a spokesperson.

"Releasing GPT-3 via an API allows us to safely control its usage and roll back access if needed."

At present, the OpenAI API service is limited to approved parties; there is a waitlist one can join to gain access.

"Right now, the API is in a controlled beta with a small number of developers who submit an idea for something they'd like to bring to production using the API," OpenAI told ZDNet.

There are intriguing examples of what can be done from companies in the beta program. Sapling, a company backed by venture fund Y Combinator, offers a program that sits on top of CRM software. When a customer rep is handling an inbound help request, say, via email, the program uses GPT-3 to suggest an entire phrase as a response from among the most likely responses.

Game maker Latitude is using GPT-3 to enhance its text-based adventure game, AI Dungeon. Usually, an adventure game would require a complex decision tree to script many possible paths through the game. Instead, GPT-3 can dynamically generate a changing state of gameplay in response to users' typed actions.

Already, task automation is going beyond natural language to generating computer code. Code is a language, and GPT-3 can infer the most likely syntax of operators and operands in different programming languages, and it can produce sequences that can be successfully compiled and run.

An early example lit up the Twitter-verse, from app development startup Debuild. The company's chief, Sharif Shameem, was able to construct a program where you type your description of a software UI in plain English, and GPT-3 responds with computer code using the JSX syntax extension to JavaScript. That code produces a UI matching what you've described.

Shameem showed that by describing a UI with multiple buttons, with a single sentence he could describe an entire program, albeit a simple one such as computing basic arithmetic and displaying the result, and GPT-3 would produce all the code for it and display the running app.

OpenAI has "gotten tens of thousands of applications for API access to date, and are being judicious about access as we learn just what these models can do in the real world," the company told ZDNet. "As such, the waitlist may be long."

Pricing for an eventual commercial service is still to be determined. Asked when the program will come out of beta, OpenAI told ZDNet, "not anytime soon."

"Releasing such a powerful model means that we need to go slow and be thoughtful about its impact on businesses, industries, and people," the company said. "The format of an API allows us to study and moderate its uses appropriately, but we're in no rush to make it generally available given its limitations."

If you're impatient with the beta waitlist, you can in the meantime download the prior version, GPT-2, which can be run on a laptop using a Docker installation. Source code is posted in the same Github repository, in Python format for the TensorFlow framework. You won't get the same results as GPT-3, of course, but it's a way to start familiarizing yourself.

Remember, too, new language models with similar capabilities appear all the time, and some of them may be sufficient for your purposes. For example, Google recently released a version of its BERT language model, called LaBSE, which demonstrates a marked improvement in language translation. It is available for download from the TensorFlow Hub.

GPT-3, unveiled in May, is the third version of a program first introduced in 2018 by OpenAI and followed last year by GPT-2. The three programs are an example of rapid innovation in the field of language models, thanks to two big advances, both of which happened in 2015.

The first advance was the use of what's known as attention. AI scientist Yoshua Bengio and colleagues at Montreal's Mila institute for AI observed that language models when they compressed an English-language sentence and then decompressed it, all used a vector of a fixed length. Every sentence was crammed into the same-sized vector, no matter how long the sentence.

Bengio and his team concluded that this rigid approach was a bottleneck. A language model should be able to search across many vectors of different lengths to find the words that optimize the conditional probability. And so they devised a way to let the neural net flexibly compress words into vectors of different sizes, as well as to allow the program to flexibly search across those vectors for the context that would matter. They called this attention.

Attention became a pivotal element in language models. It was used by Google scientists two years later to create a language model program called the Transformer. The Transformer racked up incredible scores on tests of language manipulation. It became the de facto language model, and it was used by Google to create what's known as BERT, another very successful language model. The Transformer also became the basis of GPT-1.

Freed of the need to rigidly manipulate a fixed-size vector, the Transformer, and its descendants could roam all over different parts of a given text and find conditional dependencies that would span much greater context.

That freedom set the stage for another innovation that arrived in 2015 and that was even more central to OpenAI's work, known as unsupervised learning.

The focus up until that time for most language models had been supervised learning with what is known as labeled data. Given an input, a neural net is also given an example output as the objective version of the answer. So, if the task is translation, an English-language sentence might be the input, and a human-created French translation would be supplied as the desired goal, and the pair of sentences constitute a labeled example.

The neural net's attempt at generating a French translation would be compared to the official French sentence, and the difference between the two is how much the neural net is in error in making its predictions, what's known as the loss function or objective function.

The training phase is meant to close this error gap between the neural net's suggested output and the target output. When the gap is as small as can be, the objective function has been optimized, and the language model's neural net is considered trained.

But having the desired output carefully labeled can be a problem because it requires lots of curation of data, such as assembling example sentence pairs by human judgment, which is time-consuming and resource-intensive. Andrew Dai and Quoc Le of Google hypothesized it was possible to reduce the labeled data needed if the language model was first trained in an unsupervised way.

Instead of being given a sentence pair, the network was given only single sentences and had to compress each one to a vector and decompress each one back to the original sentence. Mirroring became the loss function to optimize. They found that the more unlabeled examples were compressed and decompressed in this way, the more they could replace lots of labeled data on tasks such as translation.

In 2018, the OpenAI team combined these two elements, the attention mechanism that Bengio and colleagues developed, which would roam across many word vectors, and the unsupervised pre-training approach of Dai and Le that would gobble large amounts of text, compress it and decompress it to reproduce the original text.

They took a standard Transformer and fed it the contents of the BookCorpus, a database compiled by the University of Toronto and MIT consisting of over 7,000 published book texts totaling nearly a million words, a total of 5GB. GPT-1 was trained to compress and decompress those books.

Thus began a three-year history of bigger and bigger datasets. The OpenAI researchers, hypothesizing that more data made the model more accurate, pushed the boundaries of what the program could ingest. With GPT-2, they tossed aside the BookCorpus in favor of a homegrown data set, consisting of eight million web pages scraped from outbound links from Reddit, totaling 40GB of data.

GPT-3's training is still more ginormous, consisting of the popular CommonCrawl dataset of Web pages from 2016 to 2019. It is nominally 45TB worth of compressed text data, although OpenAI curated it to remove duplicates and otherwise improve quality. The final version is 570GB of data. OpenAI supplemented it with several additional datasets of various kinds, including books data.

With the arrival of GPT-1, 2, and 3, the scale of computing has become an essential ingredient for progress. The models use more and more computer power when they are being trained to achieve better results.

What optimizes a neural net during training is the adjustment of its weights. The weights, which are also referred to as parameters, are matrices, arrays of rows and columns by which each vector is multiplied. Through multiplication, the many vectors of words, or word fragments, are given greater or lesser weighting in the final output as the neural network is tuned to close the error gap.

OpenAI found that to do well on their increasingly large datasets, they had to add more and more weights.

The original Transformer from Google had 110 million weights. GPT-1 followed this design. With GPT-2, the number was boosted to 1.5 billion weights. With GPT-3, the number of parameters has swelled to 175 billion, making GPT-3 the biggest neural network the world has ever seen.

Multiplication is a simple thing, but when 175 billion weights have to be multiplied by every bit of input data, across billions of bytes of data, it becomes an incredible exercise in parallel computer processing.

Already with GPT-1, in 2018, OpenAI was pushing at the boundaries of practical computing. Bulking up on data meant bulking up on GPUs. Prior language models had fit within a single GPU because the models themselves were small. GPT-1 took a month to train on eight GPUs operating in parallel.

With GPT-3, OpenAI has been a bit coy. It hasn't described the exact computer configuration used for training, other than to say it was on a cluster of Nvidia V100 chips running in Microsoft Azure. The company described the total compute cycles required, stating that it is the equivalent of running one thousand trillion floating-point operations per second per day for 3,640 days.

Computer maker and cloud operator Lambda Computing has estimated that it would take a single GPU 355 years to run that much compute, which, at a standard cloud GPU instance price, would cost $4.6 million. And then there's the memory. To hold all the weight values requires more and more memory as parameters grow in number. GPT-3's 175 billion parameters require 700GB, 10 times more than the memory on a single GPU.

It's that kind of enormous power requirement that is propelling the field of computer chips. It has driven up the share price of Nvidia, the dominant GPU supplier for AI training, by almost 5,000% over the past ten years. It has given rise to a raft of startup companies backed by hundreds of millions of dollars in venture capital financing, including Cerebras Systems, Graphcore, and Tachyum. The competition will continue to flourish for as long as building bigger and bigger models remains the trajectory of the field.

OpenAI has produced its own research on the soaring computer power needed. The firm noted back in 2018 that computing cycles consumed by the largest AI training models have been doubling every 3.4 months since 2012, a faster rate of expansion than was the case for the famous Moore's Law of chip transistor growth. (Mind you, the company also has produced research showing that on a unit basis, the ever-larger models end up being more efficient than prior neural nets that did the same work.)

Already, models are under development that use more than a trillion parameters, according to companies briefed on top-secret AI projects. That's probably not the limit, as long as hyper-scale companies such as Google are willing to devote their vast data centers to ever-larger models. Most AI scholars agree that bigger and bigger will be the norm for machine learning models for some time to come.


Related Articles