Introduction
This is a plain explanation of how OpenAI's GPT models developed, from GPT-3 in 2020 to the AI agents available today. It assumes no background knowledge.
The clearest way to follow this is problem by problem. Each model below fixed a specific limitation in the one before it. Most of them also introduced a new limitation, which the next model then addressed. Read in order, that chain explains how these systems work and why they are built the way they are.
Each part covers one model: what was new, what it could now do, and what it still could not. A small tracker after each part records the capabilities as they add up.
Part One: Predicting the Next Word
Model: GPT-3
GPT-3, released by OpenAI in 2020, was trained to do one thing: given some text, predict the word that comes next. It did this across a very large amount of text from the internet, repeated billions of times.
To predict the next word accurately, a model has to pick up a lot along the way, including grammar, facts, and common patterns of reasoning. This training stage is called pretraining, and it is what gives the model its broad knowledge.
The main change from the earlier GPT-2 was size. GPT-3 had about 175 billion internal values, called parameters, and was trained on far more text using far more computing power. Increasing these together is what people mean by scaling a model. At this larger size, GPT-3 could handle tasks it was never specifically built for, such as basic translation or arithmetic. Abilities that appear only once a model is large enough are called emergent abilities.
GPT-3 had a clear limit. It only knew how to continue text. Given a question, it might continue with more questions in the same style instead of answering. It also stated false information with the same confidence as true information, and it kept nothing from one request to the next.
What it could now do
- Picked up grammar, facts and patterns from a large amount of text
- Handled tasks, such as simple translation, that it was not built for
What it still could not do
- Did not reliably answer a question, it only continued text (no question-answer format)
- Stated false information as confidently as true information (hallucination)
- Kept no information between requests (statelessness)
Next problem: Get the model to actually do what it is asked, instead of just continuing the text.
Part Two: Following Instructions
Model: InstructGPT
To make the model follow instructions, OpenAI added a second training stage after pretraining, called fine-tuning. InstructGPT used two steps.
In the first step, people wrote good answers to a range of prompts, and the model was trained on these prompt-and-answer pairs. This taught it to respond to a request rather than continue it. This step is called supervised fine-tuning (SFT).
In the second step, the model produced several answers to a prompt, and people ranked them from best to worst. Those rankings were used to train a separate scoring system, and the model was then adjusted to produce answers that scored higher. This step is called RLHF, short for reinforcement learning from human feedback. It captures qualities that are hard to write out directly, such as being clear and not evasive.
The result was notable. A much smaller InstructGPT model (1.3 B parameters, 100x smaller) produced answers that people preferred over the original, far larger GPT-3 (175 B parameters). How well a model followed instructions mattered more than how large it was.
InstructGPT still answered one prompt at a time. It had no way to handle a back-and-forth conversation.
What it could now do
- Responded to instructions instead of continuing the text
- Gave clearer and more useful answers
What it still could not do
- Could not follow a conversation across multiple messages (statelessness)
- Still stated false information at times (hallucination)
Next problem: Let the model hold a conversation, keeping track of earlier messages.
Part Three: Holding a Conversation
Model: ChatGPT (running GPT-3.5)
ChatGPT launched in late 2022. It helps to be precise about the names: ChatGPT is the product, and the model behind that launch was GPT-3.5, an improved version of GPT-3 fine-tuned on conversation-style data.
The model itself still kept no memory between requests. The conversation was handled outside the model. Each time you send a message, the application sends the model the whole conversation so far. The model re-reads the full exchange every time and produces the next reply in context. The conversation feels continuous to the user, but the model is not remembering it. The application is resupplying it.
This outside layer, the part that resends the conversation and later connects the model to other systems, is called the harness. Many later improvements happen in the harness rather than inside the model. Conversation was worked around.
The change that made ChatGPT widely used was not a more capable model. It was a simple chat interface that anyone could use without technical knowledge.
ChatGPT only knew what was in its training data. It had no access to current or external information.
What it could now do
- Held a conversation across many messages
- Could be used by anyone through a simple chat interface
What it still could not do
- Knew only its training data, with no access to current information (hallucination)
- Still had no memory of its own (no native memory)
Next problem: Give the model a way to reach information beyond its training data.
Part Four: Images and Tools
Model: GPT-4 and GPT-4o
GPT-4, released in 2023, added two main capabilities.
The first was image input. GPT-4 could take an image as part of the prompt and describe or analyse it, not only process text. And so, it's multimodal. Note - GPT-4 stitched a separate vision encoder onto a text model, whereas GPT-4o was trained from scratch as a single, natively multimodal network - making 4o faster, cost effective, and accurate.
The second was tool use. Through the harness, the model could call external tools such as web search, a code runner, or document lookup. The process is straightforward: the model outputs a request to use a tool, the harness runs that tool, and the result is added to the context for the model's next response. This let the model retrieve current information instead of relying only on its training data.
With GPT-4, OpenAI is reported to have brought Mixture-of-Experts (MoE) to commercial scale, routing each token to a handful of expert networks to get massive capacity at a fraction of the inference cost, a design that quickly became the industry standard
GPT-4 also had a larger context window, which is the amount of text it can take in at once (from 16k with GPT 3.5 to 128k in GPT-4-turbo). Later versions increased this further.
One limit remained. GPT-4 still produced its answer in a single pass, straight away. On hard problems that need several steps, it would often commit to an answer too quickly and get it wrong.
What it could now do
- Accepted images as input, not only text (multimodal capability)
- Used tools such as web search to get current information (tool execution triggering capability)
What it still could not do
- Produced an answer in one pass, with no working-out step (no reasoning)
- Still made confident mistakes on hard, multi-step problems (hallucination)
Next problem: Change how the answer is produced, so the model can work through a problem before committing.
Part Five: Reasoning Before Answering
Model: o1
o1, released by OpenAI in late 2024, is a reasoning model. The change is in how it produces an answer.
Earlier models generated the answer directly. o1 first generates a long sequence of intermediate steps, working through the problem, before giving its final answer. These steps are not a separate system. They are simply more text the model writes, which then becomes part of what it reads when producing the final answer. Writing more steps means the model does more computation before committing, which improves results on hard problems. In everyday terms, this is closer to working a problem out on paper than answering from memory.
o1 was trained for this using reinforcement learning on problems where the answer can be checked automatically, such as maths and code. Reasoning paths that reached correct answers were reinforced (reinfrocement learning on reasoning tokens), so the model learned which approaches tend to work.
This came at a cost. Generating all those intermediate steps makes o1 slower and more expensive to run than earlier models. And on questions with no checkable correct answer, it could still be wrong.
What it could now do
- Worked through problems step by step before answering, and so, much stronger on hard maths, coding, and logic (reasoning)
What it still could not do
- Was slower and more expensive to run
- Could still be wrong where the answer cannot be checked
Next problem: Combine reasoning with tools, so the model can carry out a whole task rather than answer one question.
Part Six: Acting on a Task
Model: o3, GPT-5 and agents
After o1, OpenAI released stronger reasoning models, including o3, and later GPT-5. Two developments matter most in this period.
The first is agents. An agent is a reasoning model placed in a loop with tools. It plans a step, calls a tool, reads the result, plans the next step, and repeats this until a task is finished. This can run for many steps. OpenAI's agent-style products, such as its deep research feature and its ChatGPT agent, work this way. The model is no longer only answering questions. It is carrying out tasks.
The second is routing. Earlier, a user had to choose between a fast model and a slower reasoning model. With routing, the system looks at the question and decides for itself how much reasoning to apply: a quick reply for a simple question, more steps for a hard one.
Several problems are still open. These models still state false information at times. They still have no built-in or native memory; any memory feature is supplied by the harness, stored outside the model and added back into the context. And allowing a model to take actions on its own raises questions of reliability and oversight that are not fully solved.
What it could now do
- Carries out multi-step tasks, not only single answers
- Decides for itself how much reasoning a question needs
What it still could not do
- Still states false information at times (hallucination)
- Still has no built-in memory of its own (no native memory)
- Acting on its own raises reliability and oversight questions
Summary: The Sequence in Full
Across all six steps, the pattern is consistent. Each model fixed a major limitation in the one before it: GPT-3 could not follow instructions, so InstructGPT was trained to; InstructGPT could not converse, so ChatGPT's harness handled conversation; ChatGPT could not reach new information, so GPT-4 added tools; GPT-4 answered too quickly, so o1 added reasoning; o1 could only answer, so agents were built to act, with routing deciding how much reasoning each task needs.
| Model | What it added | What it still lacked |
|---|---|---|
| GPT-3 | Broad language ability, from pretraining | Could not follow instructions |
| InstructGPT | Instruction-following, from fine-tuning | Could not hold a conversation |
| ChatGPT (GPT-3.5) | Conversation, handled by the harness | No access to current information |
| GPT-4 / GPT-4o | Image input and tool use | Answered without working through steps |
| o1 | Step-by-step reasoning | Slow, and only answered, did not act |
| o3, GPT-5, agents | Multi-step tasks, plus routing | Memory, factual accuracy, oversight |
The problems that remain are visible in the same list: reliable memory, factual accuracy, and trustworthy autonomy. These are the focus of current work, which is why the sequence is not finished.