Thoughts on Agents of Code
When we ask what an AI agent is today, we're really asking about something that's still taking shape before our eyes. Like trying to describe a mountain while climbing it, our perspective keeps shifting. But if we pause to look around, we can see three distinct horizons emerging: where we are, what is emerging, and where we might be headed.
AI agents are simply programs that could follow instructions to complete isolated tasks. But they're rapidly evolving into something more substantial: systems that can understand larger contexts, coordinate multiple steps, and handle increasingly complex assignments with less human guidance.
I should note upfront that my knowledge in this area comes mostly from studying coding agents, which may or may not be representative of all AI agent development. It's also important to distinguish between what we know from established benchmarks versus what I'm inferring based on patterns, and I'll try to make these distinctions clear throughout.
H1: Where We Are?
The current landscape of AI agents is dominated by what we might call the "ReAct paradigm" – systems built around reasoning and acting in cycles. This architecture has certain merits: it's intuitive, maps well to how we think humans solve problems, and we have concrete evidence of its effectiveness, albeit with significant limitations.
In coding specifically, ReAct-based agents follow a predictable pattern. They analyze the problem, search for relevant files, read portions of code, make edits, and verify results – all in discrete steps with the AI model guiding each transition.
The "ReAct-based agent architecture" exemplifies this approach. Based on measurements that are publicly available, it achieved 18.00% success on the SWE Bench Lite benchmark using GPT-4 Turbo1. Its components include file search tools that summarize each file in one line, file viewers with 100-line viewports (which ablation studies showed to be optimal), and editors for code modification.
But these systems face limitations that become clearer the more we use them:
First, they struggle with context management. After about 15 interactions, performance measurably degrades. This is verifiable data, not speculation. Even models advertising 200,000-token context windows show diminishing comprehension as context grows. The practical limit appears much lower than advertised maximums.
Second, they have difficulty maintaining consistent focus over long sequences. A single misstep can derail the entire process. In benchmarks with multiple steps, performance decreases with each additional required action.
Third, they're often inefficient. Each reasoning and acting cycle requires a new model call, increasing costs and latency. How much overhead this creates varies by implementation, but in many cases, more resources go into the scaffolding than the actual problem-solving.
Despite these limitations, first-horizon agents have shown measurable capabilities. Various benchmarks show they can handle some coding tasks successfully, with the latest models like Claude 3.5 Sonnet v1022 achieving 38.33% resolution rate on SWE-bench Lite with execution feedback2, GPT-4.5 reaching 38.0% on SWE-bench Verified3, and Lingma SWE-GPT 72B achieving 30.20% on SWE-bench Verified4. However, there are concerns about benchmark reliability, with a recent study showing that when filtering out problematic issues, resolution rates can drop significantly, such as SWE-Agent+GPT-4's performance falling from 12.47% to 3.97%5.
H2: The Emerging Alternatives
While first-horizon agents continue to improve incrementally, we're seeing approaches that challenge core assumptions about how AI agents should function. I want to be clear about what's demonstrated versus what's promising but less certain.
The "agentless" architecture has shown concrete results. Rather than implementing an explicit feedback loop with discrete steps, this approach makes a single augmented model call, providing all necessary context upfront. Benchmark results show it achieved 33.2% success on SWE-Bench6, outperforming ReAct's 18% with significantly less code. Even with retry mechanisms, it only gained 1-2 additional percentage points.
Anthropic's "string replace" technique represents another approach with verified results. By fine-tuning models for string replacement operations and using minimal scaffolding (around 100 lines of Python), they achieved 49% success on SWE-Bench7. This approach shifts intelligence from the scaffold into the model itself, though it creates longer execution paths with less predictable costs.
We're also seeing multi-agent architectures that divide tasks among specialized components. Some implementations like CodR mimic corporate hierarchies with manager agents, product managers, editors, and QA engineers. Others split reasoning and editing between different models. What's less clear is how much better these approaches perform compared to simpler architectures; the data here is more preliminary.
Among the tensions becoming visible in this horizon:
Specialization vs. generality: There are measurable performance gaps between seemingly similar-class models – like Claude Sonnet reportedly outperforming GPT-4 Omni by ~5.8 points on graduate-level reasoning8. This suggests specialization may offer advantages, though the full picture is more complicated.
Tool complexity: Current evidence suggests models struggle with numerous complex tools, particularly those with many flags and options. But determining the optimal level of abstraction remains an open question.
Context management strategies: In controlled tests, performance degrades after 2,000 tokens and worsens after 32K9. This is factual information, not conjecture. But the optimal strategy for working within these constraints isn't yet established.
Human involvement: This is perhaps the least studied aspect empirically. While approaches like Replit Agent propose specific human-in-the-loop strategies, I haven't seen rigorous comparisons of different intervention models.
These questions don't have universal answers, and the data we have is still emerging. Different contexts appear to favor different approaches, but the field lacks consensus on many fundamental questions.
H3: The Potential Transformation
Now I'm going to shift into more speculative territory. Beyond current implementations and emerging innovations, we can imagine a third horizon – one that might transform software development practices. I want to be clear that this section contains more conjecture than established fact.
One possibility is the democratization of software creation – a world where non-engineers can create functional software through conversation. Some experimental approaches like Replit Agent try to address this by helping users break tasks into subtasks and creating a dialogue between intent and implementation. But whether this approach can scale to complex applications remains an open question.
This evolution follows a traceable trajectory in coding model benchmarks. We've seen rapid progress from the basic Human Eval benchmark (now saturated at approximately 95% for frontier models) to more complex benchmarks like SWE-Bench. But benchmark performance doesn't always translate to real-world capability, and we should be cautious about extrapolating too far.
Several challenges appear likely, based on current patterns:
First, there often exists a mismatch between what users request and what they actually need. This is a well-documented problem in software engineering, not specific to AI. But in an agent-first world, the human interpreters who traditionally bridge this gap might be missing. How significant this problem will be remains uncertain.
Second, reliability at scale is still unproven. Current systems show performance degradation as task complexity increases, with success rates approaching zero as step count rises. Whether this is a fundamental limitation or a temporary obstacle isn't yet clear.
Third, the economics remain unsettled. Current agent development requires significant investment in prompt engineering, often specific to particular models. Switching costs between vendors appear high, though this is based more on industry reports than rigorous studies. Most production agents are reportedly limited to processing about 100K tokens of code, but this limitation could change.
What's especially uncertain is how quickly these challenges might be overcome, and whether progress will be steady or come in breakthrough moments.
Building Bridges
What's particularly interesting is that all three horizons exist simultaneously. We have evidence-based understanding of the first horizon, emerging data about the second, and informed speculation about the third.
The evolution of coding models illustrates this progression. We've witnessed a documented shift from autoregressive code generation to instruction-following models, then to repo-level understanding and tool usage. This trajectory suggests coding capabilities often precede developments in other AI domains, though this pattern might not continue indefinitely.
For those working on first-horizon solutions, the challenge is optimization within known constraints. Techniques like symbol indexing, grep/ripgrep for file scanning, and vector search show promise for addressing specific limitations like file localization.
For those exploring the second horizon, the question is which emerging patterns will prove most valuable. The industry appears to be moving toward minimizing custom code around models as capabilities improve, but whether this trend continues depends on many factors.
And for those considering the third horizon, the task is identifying which technologies might facilitate transformation. Industry standardization around formats like diffs could enable broader adoption, but predicting which standards will emerge is difficult.
What is there in the future?
Where does this leave us? Here are several patterns that seem likely to shape AI agents in the coming years, based on current evidence and reasonable extrapolation:
Intelligence will likely shift toward models. As models become more capable, agent frameworks appear to be becoming simpler. We're seeing experiments with specific tools and clear interfaces becoming standard, and a trend toward letting models decide action sequences.
Tool interfaces will probably converge on conventions that balance human readability with model comprehension. The industry is exploring various formats, but it's too early to predict which will become standard.
Human-in-the-loop approaches will likely persist, even with increasing automation. The "compounding errors" problem makes pure automation risky for complex tasks, but the optimal balance remains unknown.
Cost-performance tradeoffs will continue driving decisions. We know that test-time scaling (multiple attempts) improves performance but increases costs. This tension will likely shape architecture choices, though how dramatically costs will fall is uncertain.
Domain-specific optimizations appear to offer advantages over general-purpose solutions, especially in domains like software development. But determining the right level of specialization remains challenging.
It's important to emphasize that these are informed projections, not certainties. The field is evolving rapidly, and today's conventional wisdom could be tomorrow's discarded hypothesis.
What seems most clear is that the agents we'll see five years from now will differ considerably from today's versions. They'll likely be more reliable and more integrated into workflows. But their most transformative effects will probably come not just from enhancing existing practices but from enabling new approaches to creation and problem-solving that today we can barely imagine.
I don't know exactly what those transformations will look like – no one does. But by examining the horizons that are visible today, we can perhaps better prepare for the ones that will emerge tomorrow.
https://openai.com/index/introducing-swe-bench-verified/
https://openreview.net/pdf/921e90476934d701bd24e8e53e66470f5e361548.pdf
https://www.helicone.ai/blog/gpt-4.5-benchmarks
https://github.com/LingmaTongyi/Lingma-SWE-GPT
https://openreview.net/pdf?id=pwIGnH2LHJ
https://openai.com/index/introducing-swe-bench-verified/
https://aiheroes.io/claude-3-5-sonnet-overtakes-gpt-4o/
https://guptadeepak.com/complete-guide-to-ai-tokens-understanding-optimization-and-cost-management/

