Generative language models like ChatGPT (and now GPT-4) have astonished many of us with their remarkable ability to answer almost any question in a confident, conversational language in a tiny fraction of the time it would take even the fastest human to write a response. Yet people almost immediately noticed something funny: ChatGPT is frequently comically wrong. It makes obvious logical errors. It can’t begin to do the simplest math. Even worse, when challenged it doubles down and insists it’s right. For example, my SingleStone colleague Jonathan Kuhn prompted an early version of ChatGPT with this question:
what is 3.1415926536 * 1.4142135624 ?
ChatGPT initially gave the wrong answer, and after repeated “scolding” eventually produced this:
I apologize if there has been any confusion. I can assure you that my previous answers are correct. Here is the arithmetic to show the correct result:3.1415926536 * 1.4142135624= 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536 + 3.1415926536= 44.41421356237309
I hope this helps to clarify the correct result. Please let me know if you have any other questions.
This resembles nothing so much as a student who has heard some math concepts but has no idea how to apply them and just wings it. Notably, later versions of ChatGPT, as well as GPT-4, do appear to be able to do the math. But they do it by handing off the request to a math specialist.
How generative language models work
Some weren’t surprised. As Gary Marcus has pointed out, generative language models include no concept of truth. They work by predicting what word most likely comes next, based on what’s been written so far. They derive this likelihood from all the examples of how words are used in their training data. But the likelihood of a word appearing in context tells us little about whether the usage creates a statement that is factually correct or logically sound. The models are just winging it. Compounding this problem, generative language models (and ML models more generally) perpetuate biases in their training data.
The usefulness of generative AI despite its limitations
So how is a confident liar—unaware of their own fallibility—useful? Aren’t there risks in creating truthful-sounding nonsense? Absolutely. And we best approach these tools with caution. Nevertheless, generative AI tools can be useful even without solving this fundamental limitation.
A great deal of the risk comes from mismatched expectations. We compare AI to people, and people do several things simultaneously as they write. We correct spelling and grammar, monitor style, follow chains of logical reasoning, and check facts. We’re so good at doing all of these things at the same time that we understand them to be just part of “writing.”
Yet people don’t always write in such a self-contained way. Magazines employ professional writers, but they also employ other people who are individually responsible for different parts of the process. The New Yorker is famous for having highly rigorous fact-checkers. Do writers who work for the New Yorker write stylish, grammatically correct prose that is also mostly factually accurate? Probably! But that’s not nearly good enough.
Protein design is a great example of how to utilize generative AI
For about 12 years before I migrated into software, I used to do scientific research on the problem of protein design. Proteins do almost all of the work of biology, including digesting food, making muscles contract, and sensing changes in the environment. These tiny machines are long chains of smaller molecules called amino acids, arranged sort of like a very long charm bracelet. We represent the different “charms” with letters, and a protein’s sequence is about the same length as an English paragraph with all the spaces and punctuation removed. Proteins even have features that are somewhat analogous to words and sentences, although no human can directly read a protein sequence or understand its grammar or logic.
The sequence of the amino acids determines what a protein does. Here is a sequence of a protein that detects blue light and tells a plant to grow toward it:
If we find a new protein that has the right kind of similarities to the above sequence, we can infer that it also senses light and directs behavior. Though we would like to be able to generate completely new proteins that do other useful things by composing novel sequences, we really don’t understand how to do it.
If you haven’t worked in the field yourself, you probably have a hard time appreciating just how hard of a problem this is. Here is an example. Suppose that the only way you could write was to get a bunch of monkeys and give them typewriters and wait around until they produced a paragraph that said more or less what you wanted to say.
Intolerable, right? But this is more or less what designing new proteins has been like.
Now imagine that someone gave you a computer program that could write a topical, reasonably styled, and grammatically correct paragraph in an hour or so—with the only catch being some factual and logical errors. That program would be a godsend. Well, that’s what researchers at the UW Institute of Protein Design and Salesforce (of all places) have done for proteins. They used generative AI to create new proteins that are basically correct, but don’t fully work as intended. The designs still need several rounds of empirical validation and error correction. This isn’t seen as a big deal because scientists are used to working this way. And so are writers at highbrow magazines.
Leverage productivity gains, but recognize limitations
This is not to say that the limitations of current generative AIs aren’t real. If you don’t realize that the mistakes and biases are there, then you’re in for some serious problems. But the danger comes from treating the AI as if it were a human writer who does all of the parts of writing simultaneously and approximately equally well.
We can instead think of the AI as a sort of extreme specialist that can’t do some parts at all but does other parts orders of magnitude faster than a human could do them. For most of us who are subject matter experts and not professional writers, writing stylish, grammatically correct prose is…slow. What if we could unload just that part of the writing process onto software that is incredibly fast and cheap?
That framing creates a way to get real value out of even unreliable AIs. It’s a great tool, but it’s fast, not perfect. Take time to consider which tasks generative AI can accelerate and use it for those. The question then becomes, how do we reorganize work done by human teams to take advantage of these limited but real productivity gains?