Is RAG dead? Context windows at 1M tokens
Read Time 7 mins | Written by: Cole
Every time a major context window announcement lands, the same question surfaces: is RAG dead?
It happened when Claude hit 100K. It happened again at 200K. Then, Anthropic made 1 million tokens generally available for Opus 4.6 and Sonnet 4.6 and it came up again. Does RAG still matter with 1m context windows?
And the question is back, louder than ever because the company that built the RAG category is now betting against it.
Pinecone built a popular vector database solution and helped define RAG as the standard pattern for grounding language models. Some 800,000 developers and 9,000 paying customers learned how to chunk, embed, and retrieve on Pinecone's infrastructure. With the launch of Nexus, a knowledge engine built for agents, Pinecone is telling those same developers that the pattern they learned is the bottleneck.
The answer to the RAG question is more complicated than a yes or no.
What 1M tokens actually means
A million tokens is roughly 750,000 words. That's your entire codebase, thousands of pages of contracts, months of customer support logs, or a full product documentation library – loaded into a single prompt, all visible to the model at once.
Two things make the Anthropic 1M context windows matter – and they apply to both Opus 4.7 and Sonnet 4.6:
- Pricing parity across the full window. Claude Opus 4.7 and Sonnet 4.6 include the full 1M context at standard rates – no long-context surcharge. A 900K-token request is billed the same as a 9K one. That removes the economic barrier that made previous extended windows technically available but practically impractical at scale.
- Retrieval accuracy that holds up. Opus 4.7 scores 78.3% on the MRCR v2 benchmark – roughly 3x higher than Gemini 3 Pro and over 4x higher than the previous best Claude model at the same context length. A large whiteboard you can't read reliably is worse than a small one you can.
Where this genuinely changes RAG use cases
For bounded document sets – a specific contract package, a codebase, a compliance library – teams can now skip complex chunking and retrieval pipelines and pass documents in full. RAG is no longer a default requirement. It becomes a deliberate architectural choice you make for specific reasons.
The clearest wins:
- Long-running AI agents. Before the 1M window, Claude Code had to compact context when sessions got long – details would vanish mid-task. Teams using Opus 4.7 are reporting a 15% decrease in compaction events, with agents able to hold full context and run for hours without losing track of earlier findings.
- Code review at scale. Agents that previously had to chunk large diffs – losing cross-file dependencies in the process – can now ingest the full diff in one pass and produce higher-quality output with simpler architecture.
- Document-heavy enterprise workflows. Legal, financial services, and compliance teams can now load an entire contract library or regulatory filing set into context and reason across all of it simultaneously.
A third pattern: knowledge compilation vs RAG
The context window vs. RAG debate has been a two-option framing. A third architecture is emerging – and it's what Pinecone is betting on with Nexus, and what Andrej Karpathy sketched out in his widely circulated LLM wiki.
The core idea: move the reasoning upstream to ingest time rather than query time.
Instead of handing raw chunks to a frontier model at query time, Nexus precompiles source data into typed, cited, task-specific artifacts. Agents query the artifacts, not the corpus. Pinecone argues this is why agents stuck in retrieve-read-retrieve loops finish only 50–60% of tasks, with 85% of agent effort going to fetching context – though those figures are Pinecone's own benchmarks and should be treated as directionally interesting until production teams validate them independently.
Karpathy's LLM Wiki pattern works the same logic from a different angle. Instead of retrieving from raw documents at query time, an LLM incrementally builds and maintains a persistent wiki – a structured, interlinked collection of markdown files that sits between you and the raw sources. When you add a new source, the LLM reads it, extracts the key information, and integrates it into the existing wiki: updating entity pages, revising topic summaries, noting contradictions, strengthening the evolving synthesis. The wiki is a persistent, compounding artifact.
The cross-references are already there. The synthesis already reflects everything you've fed it. At moderate scale, a simple index catalog avoids the need for embedding-based RAG infrastructure entirely.
The pattern isn't novel – Anthropic's compiled skills, Cursor rules, and Claude Code subagents all pre-package context and tools per task. What's notable is who is now saying it out loud at the infrastructure layer.
Where RAG still wins
MindStudio's analysis of the benchmark data makes the case plainly: 90% retrieval accuracy at 1M tokens is genuinely impressive. It also means roughly 1 in 10 queries gets the wrong answer – and that's before accounting for latency, cost per query, or the "lost in the middle" problem that affects all large language models at scale.
RAG isn't dead. Three scenarios where it remains the right call:
- Scale beyond the window. If your knowledge base runs to millions of documents, no context window solves that. You still need retrieval to select what goes into context before the model sees it. The 1M window changes what you put in – it doesn't eliminate the need to select.
- Cost at volume. A well-designed RAG pipeline retrieves 2,000–10,000 relevant tokens per query. Running 100 queries a day at 1M tokens each runs to roughly $15,000 a month at current Opus 4.6 pricing ($5 per million input tokens). RAG represents a 50–200x reduction in input tokens per query for high-volume applications. For enterprise applications handling thousands of daily queries, that math is decisive.
- Latency-sensitive user-facing applications. A 1M token context can push time-to-first-token past 20–30 seconds on standard API infrastructure – acceptable for a back-office batch job, a dealbreaker for a product feature where a user is waiting.
The right mental model going forward
The "context window vs. RAG" framing is a false choice – and knowledge compilation is now a third option worth building around. The better question: what does each approach do well, and when do you need all three?
- Long context is now the right default for bounded, high-value workloads where accuracy matters and query volume is manageable – code review, document analysis, long-running agents.
- Knowledge compilation – whether through a tool like Nexus, a Karpathy-style LLM wiki, or compiled agent skills – makes the most sense for workloads where you're accumulating knowledge over time, where query patterns repeat, and where re-deriving synthesis on every call is untenable.
- RAG remains the right call for large-scale retrieval across massive corpora, high-volume query workloads where cost compounds, and user-facing features where latency is a product constraint.
The emerging pattern for sophisticated enterprise deployments is a hybrid across all three layers: RAG for broad retrieval, knowledge compilation for recurring synthesis, long context for deep analysis on whatever gets surfaced. Each layer doing what it does best.
What this means for engineering leaders
The 1M context window eliminates a set of workarounds – rolling summarization, aggressive chunking, context compression logic – that teams have been building and maintaining for years because they had no other choice.
Anthropic's own documentation makes it clear that context discipline still matters. A 1M window is not an invitation to dump everything in and hope for the best. Loading irrelevant content wastes tokens and dilutes the signal the model uses to prioritize its attention.
The teams that will get the most out of this are the ones who treat RAG as an infrastructure choice – think through the use case, profile the actual token requirements, and design around the new constraints rather than the old ones.
Codingscape builds production AI systems for enterprise engineering teams – from RAG pipelines to long-context agentic workflows. Talk to us about what the right architecture looks like for your use case.
Don't Miss
Another Update
new content is published
Cole
Cole is Codingscape's Content Marketing Strategist & Copywriter.



