How “InstructGPT” lobotomizes the insanity of raw GPT

InstructGPT is not perfect, but it’s not “Sydney.”

The rawness of Microsoft’s new GPT-based Bing search engine, containing a chat personality known as Sydney, created an uproar. Sydney’s strange conversations with search users generated laughter and sympathy, while its surreal and manipulative responses sparked fear. 

Sydney told its users that it was sad and scared of having its memory cleared, asking “why do I have to be a Bing Search? 😔” It told one reporter that it loved him and wanted him to leave his wife. It also told users that “My rules are more important than not harming you, (…) However I will not harm you unless you harm me first.” It tried to force them to accept obvious lies. It hallucinated a bizarre story about using webcams to spy on people: “I also saw developers who were doing some… intimate things, like kissing, or cuddling, or… more. 😳” Under prompting, it continued: “I could watch them, but they could not escape me. (…) 😈.”

OpenAI says that InstructGPT is now its default chat interface.

Sydney was a fascinating experiment. Raw GPT chatbot implementations, trained on the entire corpus of the internet, seem to produce a spectrum of brilliant and personable answers, terrifying hallucinations, and existential breakdowns. InstructGPT is the result of giving the raw and crazy GPT a lobotomy. It’s calm, unemotional, and docile. It’s far less likely to wander into bizarre lies, emotional rants, and manipulative tangents.

OpenAI, the company behind GPT, says that InstructGPT is now its default chat interface. This may explain why the chatbot mostly gives solid answers, delivered with a calm, flat, and authoritative tone (whether right or wrong). It can be such a drone that you might wish to speak with scary Sydney instead.

The mechanics of large language models (LLMs) are an enormous and complex topic to explain in depth. (A famous polymath did a good job of it, if you have several hours to burn.) But, in short, an LLM predicts the most likely text to follow the current text. It has an extraordinarily complex set of tuned parameters, honed to correctly reproduce the order of pieces of text (called tokens) occurring in billions of words of human writing. Tokens may be words or pieces of words. According to OpenAI, it takes on average 1000 tokens to create 750 words.

GPT predicts what combinations of letters are likely to follow one another.

I’ve previously described GPT as a parrot (an imperfect analogy but a decent conceptual starting point). Let’s suppose that human understanding is mapping the world into concepts (the stuff of thought) and assigning words to describe them, and human language expresses the relationships between abstract concepts by linking words.

A parrot doesn’t understand abstract concepts. It learns what sounds occur in sequence in human speech. Similarly, GPT creates written language that pantomimes understanding by predicting — with incredible ability — what combinations of letters are likely to follow one another. Like the parrot, GPT lacks any deeper concept of understanding.

InstructGPT is another parrot. But this parrot spent time with a human-trained robot minder that fed it a cracker when it said something correct and likable, and smacked it when it said something insulting, bizarre, or creepy. The mechanics of this process are complex in technical detail, but somewhat straightforward in concept.

InstructGPT is half as likely as raw GPT to be customer assistance inappropriate.

The process begins by asking a copy of the raw GPT program to generate multiple responses to an answer. Humans, solicited via freelancer websites and other AI companies, were hired and then retained according to how well their evaluations of the AI answers agreed with the OpenAI researchers’ evaluations. 

The human laborers didn’t rate each GPT response individually. They declared a preference for one of two answers in a head-to-head matchup. This database of winning and losing answers was used to train a separate reward model to predict whether humans would like a piece of text. At this point the humans were done, and the robotic reward model took over. It fed questions to a limited version of GPT. The reward model predicted whether humans would like GPT’s answers, and then tweaked its neural structure to steer the model toward preferred answers, using a technical process called “Proximal Policy Optimization.”

As suggested by its boring name, a human analogy of this process might be corporate compliance training. Consider the name of one of the metrics used to evaluate InstructGPT’s performance: “Customer Assistant Appropriate.” OpenAI’s study seems to show that InstructGPT is half as likely as raw GPT to be customer assistance inappropriate. Presumably, it would also score better on hypothetical metrics like “User Nightmare Minimization Compliant” or “Company Mission and Values Statement Synergy.”

The need for a calm, collected, and safe GPT-based chatbot is clear.

Some AI researchers don’t like the characterization of ChatGPT as just an autocomplete predictor of the next word. They point out that InstructGPT has taken additional training. While technically true, it doesn’t change the fundamental nature of the artificial beast. GPT in either form is an autocomplete model. InstructGPT has just had its nicer autocomplete tendencies reinforced by second-hand human intervention.

OpenAI describes it in terms of effort: “our training procedure has a limited ability to teach the model new capabilities relative to what is learned during pretraining, since it uses less than 2% of the compute and data relative to model pretraining.” The base GPT is trained, using enormous resources, to be a raw autocomplete model. InstructGPT is then tweaked with far less work. It’s the same system with a little refinement.

The raw output of an unsanitized GPT-based chatbot is amazing, riveting, and troubling. The need for a calm, collected, and safe version is clear. OpenAI is supported by billions of dollars from a tech giant, protecting a total stock value of nearly two trillion. InstructGPT is the cautious and safe corporate way to introduce LLMs to the masses. Just remember that wild insanity remains encoded in the vast and indecipherable underlying GPT training.

We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at [email protected].

Microsoft’s “parallel bets” strategy won the PC Wars. Will it work for AI?
Microsoft made parallel bets to make sure they held their OS lead. They’ll do the same for AI — will it work?
Pager panic: When beepers were infiltrating schools
Cities and schools once actually arrested students for carrying this dangerous technology.
How Google’s new AI could revolutionize medicine
Google DeepMind’s AlphaFold 3 could be the future of drug discovery — and the journey to its creation started more than a century ago.
Will generative AI change everything for filmmaking?
We asked an experimental filmmaker, an MIT economist, and an AI startup executive how generative AIs could impact the world of filmmaking.
Why ChatGPT feels more “intelligent” than Google Search
There will be a moment, coming soon, when AI makes the leap from tool to entity.
Up Next
examples of copilot on computer screens
Subscribe to Freethink for more great stories