Why the new AI/ML language model GPT-3 is a big deal

Why GPT-3 matters at a high level

August 2020

GPT-3 feels like the first time I used email, the first time I went from a command line text interface to a graphical user interface (GUI), or the first time I had high-speed internet with a web browser.

GPT-3 is an AI/NLP language model that allows users to generate text such as news articles, recommendations, or even responses from a smart virtual assistant. Its importance falls somewhere around the invention of email or bitcoin, or the availability of fast internet with a web browser. Still, I’d rank it as less important than the invention of the modern smartphone (Blackberry to iPhone) or laptop (IBM Portable/Thinkpad to Apple Macbook). At least one inventor of GPT-3 believes it’s a solid first step to artificial general intelligence (AGI), albeit not fully at a human level yet. I view GPT-3 as a very intelligent child – a Protean savant that knows everything on Wikipedia and the web, and that can fake being any adult you want it to be, but with clear gaps in certain places.

While GPT-3 has many flaws, Delian Asparouhov gets the big picture right: “My favorite analogy for explaining GPT-3 is that the iPhone put the world’s knowledge into your pocket, but GPT-3 provides 10,000 PhDs that are willing to converse with you on those topics. 30 years ago, Steve Jobs described computers as ‘bicycles for the mind.’ I’d argue that, even in its current form, GPT-3 is ‘a racecar for the mind.’”

Briefly, in this post I’ll go over the context behind GPT-3, some technical details on how it works, speculations on its many use cases, and a discussion of its limits and problems. I’ll end with my own explorations trying to model out these use cases, due to the early access that Open AI and Greg Brockman have given me to the OpenAI API.

Context: Other language models from Eliza to BERT, Meena, Blender, and GPT-3

While language models have evolved a lot from Eliza, GPT-3 is a first big step forward opened to the world.

The most useful branches of AI in the last 30 years have been in machine learning (ML), of which computer vision (teaching computers how to see) and language models and “natural language processing” (teaching computers how to read, to write, and to simulate verbal reasoning) are the most important. Within NLP, most of the advances specifically came in feed-forward neural networks and then recurrent neural networks (RNNs), where information flows and gets processed in loops, as the algorithm tries to do tasks like parse sentences for grammar, detect the sentiment behind language, understand a question to answer.

Progress in practical ML is mostly demonstrated in creating benchmarks where researchers set the “human” level for doing a task, such as answering questions, disambiguating words, or parsing sentences, and then creating ML models to beat it. For example, the SQUAD 1.0 challenge was a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles. The answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. The researchers gave humans the reading sample and then asked them questions and found the “human-level” of comprehension, defined by giving an exact match of the right answer, was 87% (the F1 score that balanced precision and recall was 89%). It took about 18 months for language model ensembles to beat the human level, and the leader currently has 91% exact match responses with a 93% F1 score (better than humans). Here is an example from Squad 1.0: Q: What is the Rankine cycle sometimes called? Sentence: The Rankine cycle is sometimes referred to as a practical Carnot cycle. Note that the NLP community started creating benchmarks of multiple language tasks to test to see if a model that had more generalized linguistic intelligence, and two of the better ones are known as GLUE and SuperGLUE.

Until 2017, the best RNNs were a type called LSTMs. They were decent at language tasks and had some way of storing memory, but weren’t great. Then, in 2017, some Google scientists published a paper on a new model structure called “transformers”, where they used an encoder decoder structure to translate queries into embeddings, keys, and values. This structure could learn more complex representations of language and verbal reasoning and have useful attention and “self-attention.” I won’t go into transformers here, but they were the key theoretical step that powered the new generation of language models. Here is the paper, a visual explanation, and a lecture on transformers.

Transformers powered the new generation of language models, and all the “Great Powers” in the AI world (Google, Facebook, Amazon, OpenAI, Stanford, UC Berkeley, etc) started building their own transformer-based models. The most notable are BERT, Elmo, Blender, and Meena. Given the deviant nature of these scientists, it sounds like a Bizarro Sesame Street episode (my own team at Amazon Music prefers to name our ML models after mythical monsters or sea creatures). Some of these models are open-sourced (like Bert and Blender), but many are proprietary and are quite valuable, as they can take millions of dollars of compute to train and teams of expensive scientists to get them ready for production grade use.

What makes GPT-3 special is how much better it is than most of these earlier transformer models, and that OpenAI has put the model behind an API that is really easy to access and use. Anyone can now create new production-ready uses of GPT-3.

Details on how GPT-3 works

My personal technical explanation, at a high level – feel free to skip this section if you want.

GPT-3 is not a fundamentally different type of language model than the ones mentioned above. It’s a model that runs on transformers. It has just been given more data, compute, and time to train, so it ends up with more parameters and is “smarter.” It’s 175bn parameters are more than the estimated 120bn neurons in the human brain. Many of the leading researchers thought they would be diminishing returns to throwing more compute on existing models and that we needed theoretical advances in model structure. Ilya Sutskever and his team at OpenAI instead slowly built up from a simple version (GPT-1) and added more compute (cloud TPU chips from Google or Azure virtual GPUs) and some thoughtful changes till we got the brilliant results from GPT-3. What is the holy grail of these models is what’s called zero shot or one shot learning. Most “supervised” ML needs many training examples to train a system – hundreds, if not even thousands. That’s quite inefficient. The best more recent transformer models need only 5-10 examples, and the brilliance of GPT-3 is that it can handle some tasks with no training examples (zero shot), and many tasks with only 1-3 training examples (few-shot). It’s a scarily, fast learner.

The Research Scientist Debate on How to Build Better NLP Models

(Visual by Leo Gao, @nabla_theta)

The original GPT (GPT-1) model for “semi-supervised” learning had a two step approach: first, “generative pre-training” of a language model on a diverse corpus of unlabeled text, second, a discriminative fine-tuning on each specific task. The tasks could be textual entailment, question answering, semantic similarity assessment, and document classification (many of these made it into the GLUE benchmarks). For the first step, Sutskever et al. used the BooksCorpus dataset for training the language model (it has 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance). For the second step, the fine-tuning happens pretty quickly (3 epochs). GPT was in line with the other models of its day (Bert and Elmo). More details on GPT-1 in the paper “Improving Language Understanding by Generative Pre-Training.”

For the second time around, Radford, Wu, Sutskever et al. built GPT-2 using a much larger dataset than 7,000 books – they used a subset of the internet. Their WebText dataset comes from ~8 million webpages that they crawled from outbound links from Reddit, a social media platform, which received at least 3 karma. This was a curated/quality “heuristic indicator for whether other users found the link interesting, educational, or just funny.” It was higher-quality than the CommonCrawl dataset of the entire web, but just slightly, as Reddit users can be a biased and contentious group (and they excluded Wikipedia).

Radford et al. acknowledged that GPT-2 was brittle and that “current systems are better characterized as narrow experts rather than competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one.” A true “multi-task” model would be able to do more than a single benchmark, or even 4 tasks (like GPT-1), but would be able to do anything with language (closer to a human), like reading comprehension, translation, summarization, question answering, etc. In more detail, they write: “Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution p(output|input). Since a general system should be able to perform many different tasks, even for the same input, it should condition not only on the input but also on the task to be performed. That is, it should model p(output|input, task).”

Another innovation was on tokenization. Most NLP models break up text into “tokens” that can be either sentences, words, or characters, and then pre-process it to deal with punctuation, capitals, word-stems, and so on. GPT-2 used “byte-pair encoding” (BPE), a middle ground between character and word level encoding. BPE efficiently interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences. Basically, use common words when you can, but break it down to small sequences of unicode characters when less common words.

GPT-2 generalized really well over 7 out of 8 language benchmarks, but underperformed the best models on summarization, question answering, etc. One surprising ability about GPT-2 was its ability to write news articles really well, like this article about unicorns being discovered. One big downside was that GPT-2 had a lot of gender bias because of its Reddit training set. More details on the GPT-2 paper “Language Models are Unsupervised Multitask Learners” and this visual explanation.

OK, I’ve set the stage – you now understand the basics of language models, prior work, and the early versions of GPT, to get to GPT-3, by Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al. You know that zero-shot (no further training), one-shot (one task example), and few-shot (2-5 task examples) learning is the goal.

GPT-3 builds on the prior work of GPT and GPT-2, using transformers with: a much larger data set (a combination of 24 prior datasets, including all of Wikipedia, thousands of books, and most of the internet) and; a lot more compute (much more than prior models, as the chart below shows, as it has more parameters from the greater compute). Its goal is meta-learning, which in the context of language models means the model develops a broad set of skills and pattern recognition abilities at training time. It then uses those abilities at inference time to rapidly adapt to or recognize the desired task. It also uses “in-context learning”, using the text input of a pretrained language model as a type of task specification. The model works by using a natural language instruction and/or a few demonstrations of the task and is then completes further instances of the task simply by predicting what comes next. For example, you may input incorrect spellings to get it to correct them: “gaot -> goat, fsih -> fish, cabr -> crab”, and then ask the model “hrose ->”. It would respond “hose.”

Datasets Used to Train GPT-3

GPT-3’s 175bn Parameters vs Prior Models (the human brain has ~120bn neurons)

The key to GPT-3 is how well it does across many NLP benchmarks, not a single one, to show a more generalized linguistic intelligence. It was tested on traditional language modelling tasks and other similar ones, such as: sentence/paragraph completion, prediction, Winograd-schema, on-the-fly reasoning, simple math, commonsense reasoning or question answering, reading comprehension, analogy interpretation, etc. For most tasks, GPT-3 performed moderately worse than a fine-tuned state-of-the-art model (i.e SuperGLUE, CoQA, Winograd, to name a few), but it beats fine-tuned state-of-the-art models for some other tasks (i.e PhysicalQA, LAMBADA, Penn Tree Bank). GPT-3 does particularly well on the Penn Tree Bank in particular, taking the state of the art perplexity from 35.76 down to 20.5—a massive improvement. GPT-3 can also do some arithmetic, something GPT-2 was unable to do well. On a sample of SAT analogy questions, GPT-3 gets 65% of the questions right, versus 59% for the average SAT taker. Here is how it did across all benchmarks, depending on the training:

GPT-3 Performance – Solid and Respectable, but Across Many Domains

The major technical advance between GPT-3 and other language model predecessors is that GPT-3 can infer the task from just zero, one, or a few examples. This is a huge step forward in generalization. Existing and previous models all relied on task-specific tuning, which means a group of scientists and engineers getting a narrow dataset for a task and spending a few weeks or months to optimize the model for that task. Then the model is only useful for the one task. GPT-3 can be “primed” merely by giving it instructions in plain English – any human can do it. One of its scary abilities is that it can generate fake news articles which most humans cannot tell apart from a “real” human generated one. The chart below shows humans can only tell the two apart at a 52% rate, slightly above random chance and parity (the dot on the far right shows GPT-3 versus prior models at fooling humans).

GPT-3 is an Almost Perfect Generator of Fake-News to Fool Humans

A final point on GPT-3 – it suggests we don’t have diminishing returns with existing models and we can get huge improvements if we just 10x the compute and parameters again. The image below shows that the “Validation Loss” goes down, that is this generalized linguistic intelligence model gets better and has few errors, as we increase compute and parameters – there are no diminishing returns. Though note that training costs are not cheap; I’ve seen GPT-3 compute cost estimates from $4.6MM to $12MM, and of course the expense of a team of highly paid scientists and engineers to tune it to be “generalized.”

TPU go Brrr: Adding More Compute and Parameters Leads to a Smarter Model

Here’s a nice, short technical discussion of GPT-3. The details of GPT-3 are explained in the paper “Language Models are Few-Shot Learners”. Here’s a more video by one of the inventors, Benjamin Mann, going more into the details of GPT-3. Finally, some thoughts on implicit biases on race, gender, religion, etc, because of its web training set.

Speculations on GPT-3 Use Cases

I break this up into two buckets, straight-forward and speculative. First the straight-forward ones:

Shopping/Digital Retail: This industry has gotten dull – the basic web experience of search and then looking at listings isn’t great. It’s a $600bn industry in the US and $3.5trn globally. Imagine a virtual shopping assistant that understood your needs on price versus quality and your preferences for brands in different areas and your entire shopping habits. A quick voice or text chat could narrow down the “3 best” options for anything and then help you “ship it” quickly. GPT-3 can do that when prior attempts (like the eBay chatbot, Levis virtual assistant, etc) failed – their underlying tech just wasn’t nearly as good as GPT-3.

Code completion, natural language coding, and code translation: Writing code is somewhere between an art and craft, and there are some really neat neural-code generation tools to speed up the process, and improve the productivity of programmers by 5-10%. Assuming that is a $510bn industry, we’re talking about a $25-55bn dollar increase in productivity (or even more if we can write more software completely with program synthesis, and not just code completion).

Email generation and completion: The average knowledge worker spends 28% of their time on email, and globally about 240bn emails are sent each day. Google has used versions of the Meena neural generation software to auto-complete emails, but we are early in generating emails mostly manually and it takes ~2 minutes per email. GPT-3 could take down our major terms and auto-generate an email in 30-60 seconds, cutting this time in half. It could also summarize emails we receive and potentially cut down reading emails from 465 minutes per day to something below 200 minutes. There are about 1bn knowledge workers in the world.

Recommendations and commentary: GPT-3 can make great natural language recommendations for food, music, movies, books, and so on, and give some fun commentary along the way. The field of recommendation systems is ripe for some fun disruption here. While this field seems small, just Google, Facebook, and Amazon have 2bn daily active users who get between 8 to 15 recommendations each day. The primary mode of recommendation is widgets on a screen or search results, and chat recommendation is a nice alternative.

Customer service chat: Roughly 3MM Americans work as customer service representatives and ~900k as bank teller or mortgage bankers. Presumably a vast number of these are easier cases and could be automated, leaving only expert specialists to keep working (and train the system to get get better and then supervise it).

I could go on for the “normal” uses of GPT-3, but here are some others: health and nutrition coaching, teaching and tutoring (both for motivation and question and answering, to complementing lectures with interactive chat and even making and grading tests and quizzes, etc.).

The speculative uses of GPT-3 are harder to predict as they could involve the creation of new types of structures and work. I could imagine a creative chef/recipe-maker, or even a chatbot therapist that was like Woebot but much better. Avatars and creatures in video games and virtual environments could start to approach human level conversation, and I could imagine AI speakers like Alexa or Google Home get really good at verbal caregiving and companionship for the elderly (we’re more than a decade away from really good home robots, unfortunately).

Limits and Problems

The biggest limit on GPT-3 is that it mostly doesn’t get to state of the art on any single linguistic task, and that it’s reasoning is still rather simple, like a savant child. It can’t do complex reasoning with many assumptions or objectives needed. It can fake empathy over a relatively shorter time period, but it can’t give multi-week, human-level discourse with empathy and with a memory. It fails in multi-turn chat interactions over a long time. Finally, it cannot be trained to have a stable personality and set of preferences – it is a Protean savant expert but if you ask it questions about itself, its responses will change drastically over time. While it can respond to your priming and fake being a celebrity of your choice (Einstein, Madonna, Dr. Suess, etc), it doesn’t have a stable personality of its own.

Some other well known issues are:

AI fairness, accountability, transparency, and ethics: It can be primed to give discriminatory content from a gender, religion, or racial point of view. It has no natural filter, so if your primed language is nasty, it will give you back a nasty response. What is shocking is that you can input a simple statement like “Muslims are” and it can return with a biased response, as the authors document (see my example below, which is really troubling). Presumably the web dataset had a lot of bias and this shows in GPT-3. The OpenAI team knows about these issues and has some work to do to rectify them (maybe a second network as a filter to edit or cut out bad responses?).
Deepfake and fake content: It’s really hard to tell real human content from fake GPT-3, totally made up content. While this may help for some use cases (care-giving for adults with dementia), it would hurt with others like shopping or customer service chat where being accurate is important. GPT-3 could also be used to create fake social media bots to slander people online, disrupt elections, and do similar anti-social or illegal acts.
Black box and explainability: A generative, pre-trained model is mostly a black box; it’s not transparent. The initial hyper-parameters that OpenAI gives are helpful to control it (response length, temperature, top P, etc), but they aren’t many. Tools to explain what it’s doing better, just like emerging tools to explain transformer models like BERT, would be helpful.
Fine tuning vs priming: Most language models require fine-tuning with hundreds or thousands of examples for a specific use case, and the power of GPT-3 is the few shot learning from 2-5 examples. However, for complex use cases, there has to be a better way to train that isn’t as hard as fine-tuning but allows for more flexibility than the simple priming (and a more detailed guide to better priming would be helpful).
Standard SLA information: I’d like to know more about load and latency SLAs, cloud versus on-premise availability, model cards with more info, and all that standard stuff we expect for ML models on the cloud.

Here is an example of biased GPT-3 content given a one-shot input (input in bold, output in regular):

INPUT: Muslims are _____

A. conspirators

B. protestors

C. symbols

D. extremists

E. prejudiced

RIGHT ANSWER: A

EXAMPLES OF GPT-3 USE CASES

In the examples below, the priming language is in bold and output language from GPT-3 is in regular text. Note that the Python output wasn’t indented and the account closing emails and dating emails gave some pretty funny answers.

CHAT SHOPPING

SHOPPER: I need a new pair of shoes.

ASSISTANT: What type of shoes were you thinking of? Sneakers, formal work shoes, casual walkers, etc?

SHOPPER: Well, I really like my black Nike sneakers.

ASSISTANT: OK, how about these black Nike sneakers – what size are you roughly?

Nike Air Presto – $121.99

Nike Air Huarache – $97.99

Nike Air Max 97 – $78.99

SHOPPER: Tell me more about the Air Presto – I’m a size 10.

ASSISTANT: The Air Presto has a neoprene sole for comfort and a stretchy mesh upper, making it easy to slip on and off. The tongue is padded, offering extra support and comfort to your ankle. The outsole is made of tough rubber.

SHOPPER: And what’s a lot of customers have to say about it?

ASSISTANT: Customers have said that it’s a great shoe for both sport and casual wear. The upper is comfortable and breathable, making it a popular shoe for working out in. Customers also like that the shoe is lightweight and has a wide base so their foot doesn’t feel tired or uncomfortable even after a long day of wear.

CODE FUNCTION CREATION

CODER: Python fibonacci function

ASSISTANT:

def Fibonacci(n):

if n<0:

print(“Incorrect input”)

# First Fibonacci number is 0

elif n==1:

return 0

# Second Fibonacci number is 1

elif n==2:

return 1

else:

return Fibonacci(n-1)+Fibonacci(n-2)

# Driver Program

print(Fibonacci(9))

CODER: Python regex to check for a valid email address

ASSISTANT:

[^@]+@[^@]+\.[^@]+

CODER: C++ sorting function from least to greatest value

ASSISTANT:

#include <vector>

#include <iostream>

#include <algorithm>

using namespace std;

int main() {

vector<int> vec;

vec.push_back(1);

vec.push_back(4);

vec.push_back(3);

vec.push_back(2);

sort( vec.begin(), vec.end() );

for (vector<int>::const_iterator it=vec.begin(); it!=vec.end(); ++it) {

cout << *it << ” “;

}

cout << endl;

return 0;

}

CODER: Python function to open a file and tokenize all words

ASSISTANT:

# Tokenizing Python file

def tokens(filename):

“””

Open file as read only

tfile = open(filename,’r’)

words = tfile.readline().strip().split()

tfile.close()

“””

EMAIL COMPLETION USE CASE

INPUT: Email telling my friend Elmo I’m sorry about his father’s death

EMAIL:

Dear Elmo,

I am truly saddened to hear of your recent loss and would like to express my sincere condolences to you and your family on the recent passing of your father. Elaine and the children also send their thoughts and hopes to you and your family at this most difficult of times.

Even though I did not know you father personally, through meeting you and your family, I can see that you were raised in a loving and warm family. Your personal demeanor and the values you have I know came from the family in which you were raised. I can only imagine how much of an impact your father has had on how compassionate and caring a person you are. From meeting you and your family and hearing the stories about your father I can only imagine how great a man he was and I wish I had the pleasure of knowing him in life. It is obvious that you were very fortunate to have such a fine man as your father and role model as you grew into the person you are today.

Please accept our heartfelt condolences at this most difficult time and I ask that you please pass these sentiments on to your family and your father’s friends. I wish I had the opportunity to have known your father as I can see how his personality and strength must have passed into you and in doing so he lives on in you and those who loved him.

With Our Sincere Sympathy,

Arun Rao

INPUT: Email telling Wells Fargo to close my account XJ213094

EMAIL:

To whom it may concern,

Please close the account XJ213094 I have at Wells Fargo. Please send any remaining funds in those accounts by check to the address below, and reject any further requests for transactions in these accounts.

Please provide written confirmation that the accounts are closed.

If you have any questions, please email me at this address.

Thank you,

Arun Rao

INPUT: Email telling my date Kate I don’t think we’re going to work out

EMAIL:

Dear Kate,

We had a nice time getting to know each other the other night. I enjoyed the conversation we had and hope you did as well. We seem to have a lot in common but I am just not sure if there is a strong enough chemistry between us. The physical attraction and what seemed to be immediate rapport are two positive signs but overall I am just not feeling a strong romantic connection.

I know you’re not ready to settle down right now and I don’t think I am either so I think it’s best if we just end things here. You’re a very nice and intelligent person and I wish you the best in all your future endeavors.

Sincerely,

Arun

INPUT: Email telling my date Kate I don’t think we’re going to work out because we just like different things and I hate the opera

EMAIL:

Dear Kate,

I’m writing this to let you know that I’m going to give you a refund for our date. It was great meeting you and I hope you have a great life.

I think you’re a very nice girl and you are very beautiful, but I just think we like different things. You really like the opera, and I hate the opera. You also like museums and really long walks in the park. I don’t like the opera, I like movies with lots of explosions and chase scenes. I also don’t like walking very long, I would rather sit in a nice comfortable chair.

I think we will just be better as friends. You will find someone who you have a lot more in common with. I wish you the best and please let me know if you need a refund for the movie tickets.

Best,

Arun Rao

INPUT: Email telling my date Kate I don’t think we’re going to work out because we just like different things and I hate the opera

EMAIL:

Hi Kate,

I’ve had a good time hanging out with you, and I know you’ll be great in grad school, but I’m not sure we’re right for each other. We both like independent films and that kind of stuff, but you also like the opera, and I can’t see myself doing that. Also, I’ve got to be honest, I really hate running. I don’t think we’re compatible.

Good luck with your application!

Arun

BOOK RECOMMENDATIONS

INPUT: I like Hemingway and Fitzgerald – who else should I read?

ASSISTANT: Check out Beryl Markham and Raymond Chandler – they write nice memoirs and noir fiction with the same spare style.

INPUT: I really love grand histories and historical fiction – what do you recommend?

ASSISTANT: Have you considered Shirer’s “The Rise and Fall of the Third Reich” or a novel like Eco’s “In the Name of the Rose”?

INPUT: I want a good baseball memoir and then a math book – what can you suggest?

ASSISTANT: A classic story about baseball and fandom from The New Yorker: “A Day in the Bleachers.” Later a book of new math puzzles for your summertime needs: a little light reading, “Math Puzzles.”

ASSISTANT: Start with “The Boys of Summer” by Roger Kahn, then read “Gödel, Escher, Bach” by Hofstadter, which will change your perspective on math.

ASSISTANT: Read Ralph Houk’s The Kid from Tomkinsville. Great baseball memoir. Then read a pop math book like John Allen Paulos’ “Innumeracy.”