Agent context windows cannot hold even 1% of a real codebase
devtoolsdevtools0 views
Real software projects have 50,000 to 1,000,000+ lines of code. The largest context windows (200K tokens for Claude, 128K for GPT-4) hold roughly 10,000 lines — less than 1% of a medium codebase. So what? The agent must constantly decide what to read and what to ignore, and it frequently guesses wrong. It misses a type definition in another file, overlooks a test that would have revealed a pattern, or ignores a config that changes behavior. The result: agents confidently produce code that duplicates existing utilities, conflicts with established patterns, or breaks imports they never saw. Developers then spend more time understanding and fixing the agent's mistakes than they saved. Why does this matter in the first place? Agents are net-negative on large codebases, which is exactly where they are needed most. Small codebases are manageable by humans. It is the 500K-line monolith with 10 years of history that desperately needs agent help — and that is precisely the codebase no agent can understand. The structural reason: context windows grow linearly while codebases grow super-linearly. RAG retrieval helps for point lookups but has terrible recall for cross-cutting concerns like "how does authentication work across this entire app?" or "what are all the side effects of changing this type?". No retrieval system can substitute for actually reading and understanding the full codebase.
Evidence
Claude 200K context ≈ 10K lines of code. Average production repo is 50K-500K lines. GitHub Copilot Workspace retrieves ~20 files max. SWE-bench results drop sharply as repository size increases: https://www.swebench.com/. Cursor indexes repos but retrieval recall is poor for architectural questions that span many files.