Every agent interaction is a single synchronous session: you give it a task, it runs, it finishes, it is gone. Agents cannot monitor a system over days, follow up on a PR review cycle, track a customer onboarding process, or watch for regressions after a deploy. So what? The most valuable work in any organization is ongoing ownership, not one-shot tasks. A senior engineer's value is not in writing one function — it is in owning a system: watching for issues, responding to incidents, iterating based on feedback, maintaining quality over time. Agents cannot do any of this. They are expensive temps who disappear after each task. Why does this matter in the first place? One-shot tasks are the easy part of work. Monitoring production after a deploy, iterating on a PR based on reviewer feedback, gradually improving a system over weeks — this is where 80% of engineering time goes and 90% of the value is created. Agents are optimized for the 20% of work that was already the easiest part. The structural reason: agents run on ephemeral compute (API calls, serverless functions) with no persistent process. There is no standard way to define a long-running agent responsibility, trigger re-engagement based on external events (webhook, schedule, state change), or maintain continuity of context across multiple sessions over days or weeks. Building this requires solving agent memory, event-driven orchestration, and cost management simultaneously.
Real problems worth solving
Browse frustrations, pains, and gaps that founders could tackle.
Every agent task is described in natural language: "refactor this module," "fix the login bug," "research competitors." There are no formal acceptance criteria, no type-checked task definitions, no machine-verifiable success conditions. So what? The same instruction produces different results every run. "Fix the login bug" might mean fix the null pointer, fix the UI, fix the error message, or rewrite the whole auth flow depending on how the model interprets it. You cannot build reliable workflows on top of non-deterministic task interpretation. Every prompt is a prayer, not a specification. Why does this matter in the first place? Software engineering spent 50 years developing specification languages (types, schemas, contracts, test assertions) specifically because natural language is ambiguous and humans misinterpret each other constantly. We solved this problem for human-to-human communication in code. Now we are reintroducing the same ambiguity with human-to-agent communication and pretending natural language is fine. It is not fine — it is the same problem, and it needs the same class of solution: a formal way to define what "done" means that both humans and agents can agree on before execution starts. The structural reason: building a task specification language requires solving the hard problem of formally describing open-ended work, which is in tension with the appeal of agents ("just tell it what to do in English").
Real software projects have 50,000 to 1,000,000+ lines of code. The largest context windows (200K tokens for Claude, 128K for GPT-4) hold roughly 10,000 lines — less than 1% of a medium codebase. So what? The agent must constantly decide what to read and what to ignore, and it frequently guesses wrong. It misses a type definition in another file, overlooks a test that would have revealed a pattern, or ignores a config that changes behavior. The result: agents confidently produce code that duplicates existing utilities, conflicts with established patterns, or breaks imports they never saw. Developers then spend more time understanding and fixing the agent's mistakes than they saved. Why does this matter in the first place? Agents are net-negative on large codebases, which is exactly where they are needed most. Small codebases are manageable by humans. It is the 500K-line monolith with 10 years of history that desperately needs agent help — and that is precisely the codebase no agent can understand. The structural reason: context windows grow linearly while codebases grow super-linearly. RAG retrieval helps for point lookups but has terrible recall for cross-cutting concerns like "how does authentication work across this entire app?" or "what are all the side effects of changing this type?". No retrieval system can substitute for actually reading and understanding the full codebase.
After an AI agent finishes a task, the only way to know if it did the right thing is for a human to manually review the output. The code might compile but have subtle logic errors. The research summary might sound authoritative but contain hallucinated facts. The workflow might complete 9 of 10 steps and silently skip the hardest one. So what? If every agent output requires human review, agents do not save time — they shift the work from doing to reviewing, which is often harder because you must verify work you did not produce. You need to understand the entire output deeply enough to catch errors, which requires nearly as much expertise and effort as doing it yourself. Why does this matter in the first place? The core promise of AI agents is autonomous execution — "do this task for me while I do something else." Without automated verification, this promise is a lie. You cannot run agents overnight. You cannot scale agents to 100 parallel tasks. You cannot build products that guarantee quality agent output. Every agent deployment has a human bottleneck that limits throughput to however fast humans can review. The structural reason: automated evaluation is a solved problem for narrow tasks with clear ground truth (unit tests, math problems), but most real agent tasks are open-ended ("refactor this module," "research competitors"). Defining machine-checkable success criteria for open-ended tasks is itself an unsolved AI problem.
Every time you start a new session with an AI agent, it starts from zero. Two hours of context-setting — explaining your project conventions, debugging approach, preferred tools, past decisions — is gone. So what? You pay the same onboarding cost every single session, forever. An agent that has helped you 100 times is exactly as helpful as one helping you for the first time. It will re-suggest solutions you already tried and rejected. It will re-discover patterns you already established. It will make the same mistakes you corrected yesterday. Why does this matter in the first place? The entire value proposition of a "teammate" vs a "tool" is that teammates learn. A junior developer on day 90 is vastly more useful than on day 1 because they have accumulated context about your codebase, preferences, and past decisions. Agents cannot do this. They are permanently stuck at day 1. This caps their value at "smart stranger who knows nothing about your project" no matter how long you use them. The structural reason: LLM context windows are ephemeral by design. Solutions like RAG over chat history retrieve raw conversation chunks, not distilled knowledge. Vector databases store embeddings but lack the structured reasoning to know when a past lesson applies to a current situation. Nobody has built the knowledge distillation layer that converts session transcripts into reusable agent memory.
When an AI agent executes a multi-step task, you see what tools it called and what it produced, but not why it chose those actions, what alternatives it considered, or where it is in its overall plan. So what? When an agent fails (and they fail 30-60% of the time on complex tasks), you have no idea whether the failure was caused by a bad prompt, a bad tool, flawed reasoning, or bad luck. You cannot fix what you cannot diagnose. Debugging becomes trial-and-error: tweak the prompt, rerun, hope it works. This is not engineering — it is gambling. Why does this matter in the first place? Every mature engineering discipline has observability: software has debuggers and profilers, manufacturing has quality sensors, aviation has flight recorders. Agent development has none of this. We are building increasingly autonomous systems with less visibility into their behavior than we had with simple scripts. The structural reason this persists: LLM reasoning happens inside a forward pass that produces tokens. The model's internal deliberation is not exposed in the output unless you explicitly prompt for chain-of-thought, and even then you get a post-hoc rationalization, not the actual decision process. Tracing tools like LangSmith capture token-level I/O but cannot surface why the model chose action A over action B at a semantic level.
When you give an AI agent access to a tool (shell, database, file system, API), it gets full unrestricted access. An agent with shell access can rm -rf / as easily as it can run a test. There is no equivalent of Linux capabilities, OAuth scopes, or IAM policies for agent tools. So what? Every enterprise CTO faces a binary choice: give the agent full access (unacceptable risk) or no access (useless agent). They choose no access. This is why agent adoption in enterprises is near zero for anything touching production systems. Why does this matter in the first place? The highest-value agent use cases — database migrations, infrastructure changes, deployment pipelines, customer data processing — are exactly the ones that require access to dangerous tools. The low-risk tasks (writing docs, answering questions) are low-value. Agents are stuck doing easy cheap work because they cannot be trusted with hard valuable work. The structural reason: LLMs are non-deterministic. You cannot prove in advance that an agent will not take a destructive action. Traditional software gets permissions because it is deterministic — you can audit the code. Agent behavior depends on prompt, context, and model weights, which makes static analysis impossible. Nobody has built the runtime permission enforcement layer that would make fine-grained agent permissions viable.
AI agents from different vendors (coding agents, browser agents, data agents, email agents) cannot delegate subtasks to each other. If a coding agent needs to look something up in a browser, or a research agent needs to write code to verify a finding, they cannot hand off work — each agent is an island. So what? This forces every agent to be a monolith that tries to do everything itself, poorly. A coding agent bolts on a bad browser. A research agent bolts on bad code execution. Instead of an ecosystem of best-in-class specialized agents composing like Unix pipes, we get bloated generalists that are mediocre at everything. Why does this matter in the first place? The entire history of software shows that composable specialized tools beat monoliths (Unix philosophy, microservices, npm packages). Agents are repeating the mainframe mistake — bundling everything into one giant system — because there is no pipe between them. The structural reason this persists: every framework (LangChain, CrewAI, AutoGen, Claude Agent SDK) invented its own message format, state representation, and tool schema before anyone thought about interop. Now we have N incompatible ecosystems and no incentive for any single vendor to adopt a competitor's protocol.