# LLM Context Window Explained: Sizes by Model (2026)

> An LLM context window is the model's working memory — the maximum tokens it can read and generate at once. Here is what that means in 2026 and how context window sizes compare across the major models.

*Published 2026-06-14 · By Marcus Vance*

In short
An **LLM context window** is the maximum amount of text, measured in tokens, that a large language model can read and generate in a single request. It is the model's working memory — covering your prompt, any documents, the chat history, and the response — and anything outside it is invisible to the model.

If you have ever pasted a long document into a chatbot and watched it forget the opening by the time it reaches the end, you have run into the context window. In 2026 it is one of the most quoted specs on every model launch, and also one of the most misunderstood. Vendors race to advertise bigger numbers, but the size of the window and how well a model actually uses it are two different things. This explainer defines the term plainly, converts the jargon into words and pages, compares context window sizes by model, and explains why a bigger window is not automatically a better one.

## What is an LLM context window?

The context window is, in Anthropic's words, "all the text a language model can reference when generating a response, including the response itself" — a [working memory](https://platform.claude.com/docs/en/build-with-claude/context-windows) distinct from the vast corpus the model was trained on. Training knowledge is permanently encoded in the model's weights; the context window is the live information you hand it at run time. Everything you put in a single request shares that budget: the hidden system instructions, your question, any files or retrieved passages, the prior turns of the conversation, and the tokens the model spends writing its answer. When the total would exceed the window, something has to give — the oldest content gets dropped, summarized, or selectively retrieved, because the model literally cannot see anything beyond the window's edge.

## How are context windows measured — tokens vs words?

Context windows are measured in **tokens**, not words. A token is the unit the tokenizer produces: usually a short word, or a fragment of a longer one, plus separate tokens for spaces, punctuation, and digits. A useful rule of thumb for English is that one token is about four characters, or roughly three-quarters of a word — so 1,000 tokens is about 750 words, and a one-million-token window holds on the order of 750,000 words. Google offers vivid anchors for that scale: one million tokens is roughly "8 average length English novels" or "50,000 lines of code," per its [long-context documentation](https://ai.google.dev/gemini-api/docs/long-context). Code, tables, and non-English text tokenize less efficiently, so always estimate in tokens rather than trusting a raw word count when you are deciding whether a document will fit.

## What are the context window sizes by model in 2026?

By mid-2026 the leading commercial models converge around the one-million-token mark, while open-weight models span a far wider range. The table below lists the maximum advertised context window for several widely used models, drawn from each vendor's own documentation.
Context window sizes by model (maximum advertised, mid-2026)ModelDeveloperContext window (tokens)TypeLlama 4 ScoutMeta10,000,000Open-weightGPT-5.5OpenAI1,050,000ProprietaryGemini 2.5 ProGoogle1,000,000ProprietaryClaude Sonnet 4.6Anthropic1,000,000ProprietaryDeepSeek-V3DeepSeek128,000Open-weight
The specifics: OpenAI's GPT-5.5 model page lists a 1,050,000-token window with up to 128,000 output tokens; Anthropic announced a [one-million-token window](https://claude.com/blog/1m-context) for Claude Sonnet, enough for "over 75,000 lines of code" in one request; Google's [Gemini 2.5 Pro](https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/) ships at one million tokens with two million signposted; and Meta's [Llama 4 Scout](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) claims an industry-leading ten million. At the other end, the widely deployed open-weight [DeepSeek-V3](https://arxiv.org/abs/2412.19437) documents 128,000 tokens. Numbers move fast in this space, so treat any table as a snapshot and re-check the vendor page before you build on a specific figure.

## Does a bigger context window mean better answers?

No — and this is the most important caveat in the whole topic. A bigger window gives a model more headroom, but it does not guarantee the model will use that space well. The widely cited ["Lost in the Middle"](https://arxiv.org/abs/2307.03172) study from Stanford and collaborators found that models retrieve information placed near the start or end of a long context far more reliably than facts buried in the middle, producing a characteristic U-shaped accuracy curve even on models built for long context. Anthropic frames the broader pattern as "context rot": as the token count grows, recall and precision tend to degrade. Independent long-context benchmarks have shown some models dropping sharply well before their advertised maximum. The practical takeaway is that curation beats capacity — feeding a model the right few thousand tokens through retrieval usually outperforms dumping hundreds of thousands of tokens of raw text into the prompt.

## How does retrieval relate to the context window?

Because windows are finite and accuracy fades as they fill, most production systems do not try to stuff everything into context. Instead they use **retrieval-augmented generation (RAG)**: an external search step pulls only the most relevant passages from a larger knowledge base and places those inside the window, leaving the rest out. This keeps the prompt small, cheap, and on-topic, and it sidesteps the lost-in-the-middle problem by limiting what the model has to attend to. Large windows and retrieval are complements, not rivals — a generous window gives you room for the retrieved evidence plus a long conversation, while retrieval ensures that what occupies the window is signal rather than noise. The strongest 2026 architectures combine a capable long-context model with disciplined retrieval and clean, well-governed source data.

## The bottom line

The context window is a model's working memory, measured in tokens, and in 2026 the frontier sits around one million tokens for commercial models with open-weight outliers reaching far higher. But the advertised number is a ceiling, not a promise. What determines real-world quality is the *usable* window — how well a model maintains recall as the context fills — and the discipline you bring to what you put inside it. Understanding tokens, knowing each model's true limits, and pairing a sensible window with good retrieval will get you better, cheaper, and more reliable results than chasing the largest number on the spec sheet.

## Sources

1. [Context windows](https://platform.claude.com/docs/en/build-with-claude/context-windows)
2. [Claude Sonnet 4 now supports 1M tokens of context](https://claude.com/blog/1m-context)
3. [GPT-5.5 Model](https://developers.openai.com/api/docs/models/gpt-5.5)
4. [Long context](https://ai.google.dev/gemini-api/docs/long-context)
5. [Gemini 2.5: Our newest Gemini model with thinking](https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/)
6. [The Llama 4 herd: A new era of natively multimodal AI](https://ai.meta.com/blog/llama-4-multimodal-intelligence/)
7. [DeepSeek-V3 Technical Report](https://arxiv.org/abs/2412.19437)
8. [Lost in the Middle: How Language Models Use Long Contexts](https://arxiv.org/abs/2307.03172)

---
Source: https://aiintelreport.com/frontier-models/llm-context-window
Index: https://aiintelreport.com/llms.txt · Full text: https://aiintelreport.com/llms-full.txt
