Agents, Context, and the Real Work of AI Development

I sat in on a Buckeye AI session this week that covered reinforcement learning, evaluations, agent architectures, and the emerging practice of managing context windows. It was a fast-moving talk with a lot to digest, but what stood out most wasn’t just the technical discussion — it was realizing how many of the concepts I’ve been using for years in real AI-powered applications finally have shared names.

Having built AI-enabled systems both at Microsoft and previously as the technical founder of a healthcare technology startup, I’ve often implemented patterns long before the industry settled on formal terminology. One of my biggest takeaways was that the field is beginning to align on the language for these patterns. It reminded me of the moment — fifteen-plus years ago — when I first discovered the classic software architecture patterns book and thought: “So that’s what this thing I’ve been using is called.” Names give structure. They give us a way to talk about systems that already exist.

But the talk wasn’t just familiar ideas now wearing official labels. It also highlighted places where the presenter and I see the world differently, especially around agents, RAG, vibe coding, and the role of model development in the broader industry. Below are my reflections — the agreements, disagreements, and the open problems that still need serious thought.

The Overemphasis on Model Development

Before diving into the technical notes, I want to add a broader perspective that crystallized for me during the talk.

There is an unnatural amount of focus right now on training models, as if every company is supposed to build their own foundation model or maintain a custom model pipeline. To me, this is completely misaligned with how technology ecosystems actually work.

Yes, the models are sexy. The models are the technology that is going to enable this new wave. Yes we need to understand the models. Just like when the internet was born, we didn’t all decide to unilaterally go build a resilient distributed network of connected computers. We built websites. We built Amazon.com, we built eBay. Some frontier web companies succeeded, some did not.

The same happened during the Mobile Phone wave. After Steve Jobs mic dropped the iPhone, we all didn’t go out and start manufacturing our own version of the iPhone. Some did (here’s looking at you Samsung) but this was a small intimate group competing over who would be a dominant player as the defactor platform (or platforms). There could be more than one winner, just not dozens of winners. Apple and Android won — but countless others big and small gave it a good go at it but ultimately failed.

The rest of the industry started rethinking how we built mobile software. We used mobile devices like the iPhone and the App Store — as a platform to go solve new problems.

AI is the same. There will be model developers, there will be model distribution networks (the hyperscalers) but the lions share of the work will be done by application developers who figure out how to build software in new ways.

We do not need:

Millions of models
Thousands of custom corporate models
Every enterprise spinning up its own model-training division

That is folly.

Most organizations shouldn’t — and won’t — train their own models any more than most companies build their own CPUs or manufacture their own cars.

I see the industry evolving toward a clear division:

A small set of major players will train general-purpose foundation models. These are the auto manufacturers of AI. They build the engines.
A cottage industry will emerge around speciality or small language models (SLMs). These are niche vehicles — race cars, utility trucks, custom mods — built for specific tasks.
The rest of the industry will focus on using those models to build applications. These are the mechanics. They tune, customize, integrate, and build workflows around the engines — not reinvent the engine itself.

If your primary business isn’t “manufacturing cars,” then you buy the car and focus on using it well. The same is true for AI.

This is where the real workforce demand will be: engineers who know how to apply models, not build them.

The Question of Quality

The presentation touched on Reinforcement Learning (RL) techniques like GRPO and RLVR, but the heart of the discussion was about evaluation:

How do we verify that a model or an agent is producing accurate, reliable outcomes?

Accuracy is the lever that determines ROI. If you can’t trust the output, you’re forced to reinsert humans into the workflow, which defeats the purpose of automation.

This led me to ask two clarifying questions:

Are these evals referring to human review of generated responses — like the feedback we give in a chat interface?
Or programmatic pipelines that benchmark reasoning steps and tool interactions?

The distinction wasn’t fully clarified, but it matters. As systems shift from pure text generation toward action loops, evaluation must evolve to measure not just what the model said, but what it did.

I think chat-based systems already have a mechanism for doing this but AI-powered backends, particularly in the Agentic space there is a tremendous area of room to grow.

Agents, RAG, and the Current Shift in Focus

A claim was made that agents are replacing RAG. This is one of the strongest areas where I disagree.

RAG and agents operate in different layers of the system:

RAG grounds the model in authoritative information.
Agents orchestrate actions, decisions, and tools.

One is about knowledge. The other is about behavior.

The existence of agents does not eliminate the need for retrieval-augmented grounding. If anything, agents often depend on accurate retrieval. What is happening, however, is that RAG is becoming less of a novelty — much like APIs became less exciting once everyone had one. Necessary? Yes. Headline-grabbing? Not anymore.

Agents are the current frontier — but they stand on top of retrieval, not in place of it.

Model Labs vs. Agent Labs

A useful distinction from the talk was between model labs (those who train models) and agent labs (those who build systems that use models).

Model labs focus on:

Training pipelines
Reinforcement learning
Weight architectures
Data curation

Agent labs focus on:

Tooling
Observability
Context engineering
Verification and recovery

This distinction mirrors my earlier perspective: not every company needs a model lab. But nearly every company building AI applications now needs some flavor of an agent lab.

Trinity’s decision to train their own model was mentioned, though not deeply explained. Was it latency? Domain specificity? Cost? Intellectual property? I left wanting more detail because these decisions carry heavy tradeoffs and are often misunderstood by businesses approaching AI for the first time.

Context Windows: The Most Underspecified Problem in AI Development

The portion of the talk that resonated most deeply with me was the discussion around context engineering. The suggested pattern was:

Gather the right context
Perform the action
Validate the result
Manage what stays in context

This matches my own experience: managing the context window may be the most important factor in producing accurate output from AI systems.

So I asked the obvious question:

“How are you managing the context window? Has anyone published definitive work on this?”

The answer seems to be: Not yet.

Every team is rolling its own approach:

Summaries
Memory stores
Sliding windows
Retrieval heuristics
Token management

The closest analogy I can find comes from deterministic software architecture. Managing context in LLMs feels like the nondeterministic equivalent of cohesion. You’re controlling how much of the system’s “mental workspace” is focused on a single, narrowly defined problem.

This topic needs deep research and strong patterns. It might be an area I explore further.

On Vibe Coding and What It Means To Be a Developer

Another point of friction was around “vibe coding” — the use of AI to generate code that the developer might not fully understand.

The presenter argued:

“If you’re a developer, you shouldn’t vibe code.”

I understand the concern: developers shouldn’t surrender their agency or outsource their reasoning to the model. That part is absolutely correct.

But here’s where I part ways with the framing.

My definition of vibe coding is this:

Vibe coding is AI-assisted programming beyond the boundary of your own competency.

As a senior engineer, you frequently operate outside your comfort zone — new languages, unfamiliar frameworks, different platforms. You rely on foundational principles, not encyclopedic memorization.

In that setting, vibe coding is not a shortcut. It’s a reasonable and modern way to work. To me, AI behaves like an endless team of junior developers. They can draft code quickly, but I still own the architectural decisions, the reasoning, the debugging, and the final result.

Where the presenter and I absolutely agree is on the misuse of vibe coding: people who treat it as an “instant developer” button. The craft still matters. Reasoning still matters. Understanding still matters.

But used properly, vibe coding accelerates experienced engineers. It doesn’t replace them.

Where Our Perspectives Align

Despite the points of disagreement, the talk reinforced two things I think are universally true:

Owning the context window is essential to reliable AI behavior.
We need more rigorous frameworks for evaluating agentic systems.

The talk sparked thinking in areas that desperately need clarity, documentation, and agreed-upon patterns.

What I Took Away

A few clear themes emerged for me throughout the session:

The industry is converging on a shared vocabulary for concepts practitioners have used for years.
There are meaningful disagreements on agents, RAG, vibe coding, and what constitutes real engineering — but these disagreements sharpen our understanding.
The entire industry is overweighting model development, when the real explosion of opportunity lies in application development.
Evaluation and context window management remain the two most important unsolved practical challenges.
AI is becoming part of the software development craft — not a replacement for it.

Conclusion

The first meeting of Buckeye AI session offered a valuable snapshot of where the field is heading and where it’s still unsettled. It underscored the difference between those who build models and those who build with models — and reminded me that the real wave of innovation will come from the latter. As patterns solidify, terminology matures, and tooling becomes more accessible, the next era of AI development will be shaped not by model training but by thoughtful application.

We don’t need millions of model developers. We need millions of AI application developers — engineers who know how to apply these tools to real problems.