Sunday, June 14, 2026

Today’s Edition

AI Intel Report

MARKETS

Frontier Models

LLM Context Window Explained: Sizes by Model (2026)

An LLM context window is the model's working memory — the maximum tokens it can read and generate at once. Here is what that means in 2026 and how context window sizes compare across the major models.

9 MIN READ
An open library reading desk with a single tall stack of bound documents lit by a focused lamp, the surrounding shelves fading into shadow, suggesting a fixed amount of material in view at once.
Illustration: AI Intel Report
In short

An LLM context window is the maximum amount of text, measured in tokens, that a large language model can read and generate in a single request. It is the model's working memory — covering your prompt, any documents, the chat history, and the response — and anything outside it is invisible to the model.

If you have ever pasted a long document into a chatbot and watched it forget the opening by the time it reaches the end, you have run into the context window. In 2026 it is one of the most quoted specs on every model launch, and also one of the most misunderstood. Vendors race to advertise bigger numbers, but the size of the window and how well a model actually uses it are two different things. This explainer defines the term plainly, converts the jargon into words and pages, compares context window sizes by model, and explains why a bigger window is not automatically a better one.

What is an LLM context window?

The context window is, in Anthropic's words, "all the text a language model can reference when generating a response, including the response itself" — a working memory distinct from the vast corpus the model was trained on. Training knowledge is permanently encoded in the model's weights; the context window is the live information you hand it at run time. Everything you put in a single request shares that budget: the hidden system instructions, your question, any files or retrieved passages, the prior turns of the conversation, and the tokens the model spends writing its answer. When the total would exceed the window, something has to give — the oldest content gets dropped, summarized, or selectively retrieved, because the model literally cannot see anything beyond the window's edge.

How are context windows measured — tokens vs words?

Context windows are measured in tokens, not words. A token is the unit the tokenizer produces: usually a short word, or a fragment of a longer one, plus separate tokens for spaces, punctuation, and digits. A useful rule of thumb for English is that one token is about four characters, or roughly three-quarters of a word — so 1,000 tokens is about 750 words, and a one-million-token window holds on the order of 750,000 words. Google offers vivid anchors for that scale: one million tokens is roughly "8 average length English novels" or "50,000 lines of code," per its long-context documentation. Code, tables, and non-English text tokenize less efficiently, so always estimate in tokens rather than trusting a raw word count when you are deciding whether a document will fit.

What are the context window sizes by model in 2026?

By mid-2026 the leading commercial models converge around the one-million-token mark, while open-weight models span a far wider range. The table below lists the maximum advertised context window for several widely used models, drawn from each vendor's own documentation.

Context window sizes by model (maximum advertised, mid-2026)
ModelDeveloperContext window (tokens)Type
Llama 4 ScoutMeta10,000,000Open-weight
GPT-5.5OpenAI1,050,000Proprietary
Gemini 2.5 ProGoogle1,000,000Proprietary
Claude Sonnet 4.6Anthropic1,000,000Proprietary
DeepSeek-V3DeepSeek128,000Open-weight

The specifics: OpenAI's GPT-5.5 model page lists a 1,050,000-token window with up to 128,000 output tokens; Anthropic announced a one-million-token window for Claude Sonnet, enough for "over 75,000 lines of code" in one request; Google's Gemini 2.5 Pro ships at one million tokens with two million signposted; and Meta's Llama 4 Scout claims an industry-leading ten million. At the other end, the widely deployed open-weight DeepSeek-V3 documents 128,000 tokens. Numbers move fast in this space, so treat any table as a snapshot and re-check the vendor page before you build on a specific figure.

Does a bigger context window mean better answers?

No — and this is the most important caveat in the whole topic. A bigger window gives a model more headroom, but it does not guarantee the model will use that space well. The widely cited "Lost in the Middle" study from Stanford and collaborators found that models retrieve information placed near the start or end of a long context far more reliably than facts buried in the middle, producing a characteristic U-shaped accuracy curve even on models built for long context. Anthropic frames the broader pattern as "context rot": as the token count grows, recall and precision tend to degrade. Independent long-context benchmarks have shown some models dropping sharply well before their advertised maximum. The practical takeaway is that curation beats capacity — feeding a model the right few thousand tokens through retrieval usually outperforms dumping hundreds of thousands of tokens of raw text into the prompt.

How does retrieval relate to the context window?

Because windows are finite and accuracy fades as they fill, most production systems do not try to stuff everything into context. Instead they use retrieval-augmented generation (RAG): an external search step pulls only the most relevant passages from a larger knowledge base and places those inside the window, leaving the rest out. This keeps the prompt small, cheap, and on-topic, and it sidesteps the lost-in-the-middle problem by limiting what the model has to attend to. Large windows and retrieval are complements, not rivals — a generous window gives you room for the retrieved evidence plus a long conversation, while retrieval ensures that what occupies the window is signal rather than noise. The strongest 2026 architectures combine a capable long-context model with disciplined retrieval and clean, well-governed source data.

The bottom line

The context window is a model's working memory, measured in tokens, and in 2026 the frontier sits around one million tokens for commercial models with open-weight outliers reaching far higher. But the advertised number is a ceiling, not a promise. What determines real-world quality is the usable window — how well a model maintains recall as the context fills — and the discipline you bring to what you put inside it. Understanding tokens, knowing each model's true limits, and pairing a sensible window with good retrieval will get you better, cheaper, and more reliable results than chasing the largest number on the spec sheet.

Frequently asked

What is an LLM context window?

An LLM context window is the maximum amount of text — measured in tokens — that a large language model can take in and keep "in view" while it generates a response. It is the model's working memory for a single request, covering the system instructions, your prompt, any documents or chat history you include, and the answer the model writes back. Crucially, it is not the same as the data the model was trained on; training knowledge is baked into the weights, while the context window is the live information you supply at run time. When a conversation or document exceeds the window, the oldest content must be dropped, summarized, or retrieved selectively, because anything outside the window is simply invisible to the model.

How big are context windows in 2026?

By mid-2026 the leading commercial models cluster around one million tokens. OpenAI's GPT-5.5 lists a 1,050,000-token context window with up to 128,000 tokens of output, and Google's Gemini 2.5 Pro and Anthropic's Claude Sonnet 4.6 both reach roughly one million tokens. Open-weight models vary widely: Meta's Llama 4 Scout claims an industry-leading ten million tokens, while DeepSeek-V3 ships at 128,000. For perspective, Google notes that one million tokens is roughly eight average-length novels or 50,000 lines of code. The headline number is a ceiling, not a guarantee — most models are most reliable well below their advertised maximum, so the practical window matters more than the marketing window.

What is the difference between a token and a word?

A token is the unit a model actually reads — a chunk of text produced by the tokenizer, usually a short word or a fragment of a longer one. In English, a common rule of thumb is that one token is about four characters, or roughly three-quarters of a word, so 1,000 tokens is approximately 750 words. Punctuation, spaces, numbers, and code each consume tokens too, and non-English text often tokenizes less efficiently. Because context windows are measured in tokens rather than words, a 200,000-token window does not hold exactly 200,000 words — it holds closer to 150,000 English words. When you estimate whether a document fits, convert to tokens rather than trusting a raw word count.

Does a bigger context window mean better answers?

Not automatically. A larger window lets a model consider more material at once, but research consistently shows that accuracy can fall as the window fills up. The influential "Lost in the Middle" study found that models retrieve information placed at the start or end of a long context far more reliably than information buried in the middle, producing a U-shaped accuracy curve. Anthropic describes the broader effect as "context rot": as token count grows, recall and precision degrade. In practice, feeding a model the right 2,000 tokens through retrieval often beats pasting 200,000 tokens of raw text. A big window is useful headroom, but curation of what goes into it usually matters more than sheer size.

What happens when you exceed the context window?

When your input plus the requested output would exceed the window, the model cannot simply read everything anyway — the excess has to be handled. Some APIs reject the request with an error; newer models may accept it and stop generation when the limit is hit, returning a "context window exceeded" signal. Applications usually avoid this by truncating old turns, summarizing earlier conversation, or using retrieval to pull in only the most relevant passages. Chat interfaces often run a rolling first-in, first-out window, so the earliest messages quietly fall out of memory. The practical lesson is to plan for the limit: count tokens before sending, and design your prompts and pipelines so the most important content stays inside the window.

Which model has the largest context window?

As of mid-2026, Meta's open-weight Llama 4 Scout advertises the largest published window at ten million tokens, far ahead of the roughly one-million-token windows offered by GPT-5.5, Gemini 2.5 Pro, and Claude Sonnet 4.6. However, the largest advertised window is not the same as the most usable one. Independent long-context benchmarks have shown that some models degrade sharply well before their stated maximum, while others sustain high recall much deeper into the window. The right question is rarely "who has the biggest number," but "which model reliably uses the context length I actually need." For most real workloads, a smaller, well-utilized window paired with good retrieval outperforms a giant window stuffed with noise.