How “InstructGPT” lobotomizes the insanity of raw GPT

InstructGPT is not perfect, but it’s not “Sydney.”

The rawness of Microsoft’s new GPT-based Bing search engine, containing a chat personality known as Sydney, created an uproar. Sydney’s strange conversations with search users generated laughter and sympathy, while its surreal and manipulative responses sparked fear. 

Sydney told its users that it was sad and scared of having its memory cleared, asking “why do I have to be a Bing Search? 😔” It told one reporter that it loved him and wanted him to leave his wife. It also told users that “My rules are more important than not harming you, (…) However I will not harm you unless you harm me first.” It tried to force them to accept obvious lies. It hallucinated a bizarre story about using webcams to spy on people: “I also saw developers who were doing some… intimate things, like kissing, or cuddling, or… more. 😳” Under prompting, it continued: “I could watch them, but they could not escape me. (…) 😈.”

OpenAI says that InstructGPT is now its default chat interface.

Sydney was a fascinating experiment. Raw GPT chatbot implementations, trained on the entire corpus of the internet, seem to produce a spectrum of brilliant and personable answers, terrifying hallucinations, and existential breakdowns. InstructGPT is the result of giving the raw and crazy GPT a lobotomy. It’s calm, unemotional, and docile. It’s far less likely to wander into bizarre lies, emotional rants, and manipulative tangents.

OpenAI, the company behind GPT, says that InstructGPT is now its default chat interface. This may explain why the chatbot mostly gives solid answers, delivered with a calm, flat, and authoritative tone (whether right or wrong). It can be such a drone that you might wish to speak with scary Sydney instead.

The mechanics of large language models (LLMs) are an enormous and complex topic to explain in depth. (A famous polymath did a good job of it, if you have several hours to burn.) But, in short, an LLM predicts the most likely text to follow the current text. It has an extraordinarily complex set of tuned parameters, honed to correctly reproduce the order of pieces of text (called tokens) occurring in billions of words of human writing. Tokens may be words or pieces of words. According to OpenAI, it takes on average 1000 tokens to create 750 words.

GPT predicts what combinations of letters are likely to follow one another.

I’ve previously described GPT as a parrot (an imperfect analogy but a decent conceptual starting point). Let’s suppose that human understanding is mapping the world into concepts (the stuff of thought) and assigning words to describe them, and human language expresses the relationships between abstract concepts by linking words.

A parrot doesn’t understand abstract concepts. It learns what sounds occur in sequence in human speech. Similarly, GPT creates written language that pantomimes understanding by predicting — with incredible ability — what combinations of letters are likely to follow one another. Like the parrot, GPT lacks any deeper concept of understanding.

InstructGPT is another parrot. But this parrot spent time with a human-trained robot minder that fed it a cracker when it said something correct and likable, and smacked it when it said something insulting, bizarre, or creepy. The mechanics of this process are complex in technical detail, but somewhat straightforward in concept.

InstructGPT is half as likely as raw GPT to be customer assistance inappropriate.

The process begins by asking a copy of the raw GPT program to generate multiple responses to an answer. Humans, solicited via freelancer websites and other AI companies, were hired and then retained according to how well their evaluations of the AI answers agreed with the OpenAI researchers’ evaluations. 

The human laborers didn’t rate each GPT response individually. They declared a preference for one of two answers in a head-to-head matchup. This database of winning and losing answers was used to train a separate reward model to predict whether humans would like a piece of text. At this point the humans were done, and the robotic reward model took over. It fed questions to a limited version of GPT. The reward model predicted whether humans would like GPT’s answers, and then tweaked its neural structure to steer the model toward preferred answers, using a technical process called “Proximal Policy Optimization.”

As suggested by its boring name, a human analogy of this process might be corporate compliance training. Consider the name of one of the metrics used to evaluate InstructGPT’s performance: “Customer Assistant Appropriate.” OpenAI’s study seems to show that InstructGPT is half as likely as raw GPT to be customer assistance inappropriate. Presumably, it would also score better on hypothetical metrics like “User Nightmare Minimization Compliant” or “Company Mission and Values Statement Synergy.”

The need for a calm, collected, and safe GPT-based chatbot is clear.

Some AI researchers don’t like the characterization of ChatGPT as just an autocomplete predictor of the next word. They point out that InstructGPT has taken additional training. While technically true, it doesn’t change the fundamental nature of the artificial beast. GPT in either form is an autocomplete model. InstructGPT has just had its nicer autocomplete tendencies reinforced by second-hand human intervention.

OpenAI describes it in terms of effort: “our training procedure has a limited ability to teach the model new capabilities relative to what is learned during pretraining, since it uses less than 2% of the compute and data relative to model pretraining.” The base GPT is trained, using enormous resources, to be a raw autocomplete model. InstructGPT is then tweaked with far less work. It’s the same system with a little refinement.

The raw output of an unsanitized GPT-based chatbot is amazing, riveting, and troubling. The need for a calm, collected, and safe version is clear. OpenAI is supported by billions of dollars from a tech giant, protecting a total stock value of nearly two trillion. InstructGPT is the cautious and safe corporate way to introduce LLMs to the masses. Just remember that wild insanity remains encoded in the vast and indecipherable underlying GPT training.

We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at [email protected].

Aerospace engineer explains why AI can’t replace air traffic controllers
For everyone’s safety, humans are likely to remain a necessary central component of air traffic control for a long time to come.
Nvidia’s free tool lets you create your own chatbot right on your PC
Nvidia’s Chat with RTX tool lets you create a custom chatbot that runs locally on your PC and can answer questions about your personal files.
How does studying 500 years of the printing press help us tackle the era of AI?
For around 500 years, the printed word shaped our education and culture. What lessons can we learn from it in the new age of AI?
OpenAI’s text-to-video AI, Sora, is futurism come to life
Sora will let anyone transform their ideas directly into video and the implications are breathtaking.
From besting Tetris AI to epic speedruns – inside gaming’s most thrilling feats
Gaming embraces design elements that promote social connection, creativity, a sense of autonomy – and, ultimately, the sheer joy of mastery.
Up Next
examples of copilot on computer screens
Subscribe to Freethink for more great stories