• DE
  • ES
  • EN
  • NL

Blog

How I use AI agents to build software

Posted on Sunday, March 1st, 2026 by Jeroen Derks

How I use AI agents to build software

AI as a development tool, not a buzzword

Most discussions about AI (Artificial Intelligence) in software development fall into one of two camps: breathless enthusiasm or existential dread. This article is neither. I work as a freelance developer across PHP, Laravel, mobile (Flutter/Dart), frontend frameworks, C programs, and more. Over the past year I have integrated AI agents into my daily workflow, and I want to describe concretely what that looks like — what works, what does not, and why.

The most important constraint shaping this workflow is one that rarely gets mentioned in the hype: AI agents have limited context windows. They cannot hold an entire codebase in their working memory at once. This forces a discipline that turns out to be good engineering practice regardless of AI: break features into smaller, well-defined chunks; plan before executing; and review iteratively rather than in one large pass.

What follows covers the specific workflow I use, the infrastructure it runs on, the lessons I have learned from applying it across multiple technology stacks, and the failure modes you need to be aware of.

Claude Code and Codex

My primary planning and implementation agent is Claude Code from Anthropic. It designs architecture, writes execution plans, generates code, and works through tasks systematically. It handles the constructive part: given a clear problem statement and relevant context, it produces a plan and then implements it.

My review agent is Codex from OpenAI. Its role is adversarial in the productive sense: it receives the plan Claude produced and looks for gaps, flawed assumptions, missing edge cases, and better approaches. The two models come from different organisations, trained on different data, with different tendencies and different blind spots. That difference is the point — one catches what the other misses.

Manually reviewing every line of AI-generated code would negate most of the efficiency gain. So the review itself is delegated to an agent. My role is to intervene when the agents get stuck, when the output drifts from the actual requirement, or when domain knowledge is needed that neither model has. The final decision on what ships is always mine.

Plan, review, iterate, execute

The concrete workflow for any feature or task runs as follows. First, I give Claude Code a clear description of the task along with relevant context: which files are involved, what patterns the existing codebase follows, what the expected outcome should be. Claude produces an execution plan covering the architecture, the specific file changes, and the approach it intends to take.

That plan goes to Codex for review. Codex identifies problems: missing error handling, incorrect assumptions about the existing code, overly complex abstractions, security oversights, or cases the plan does not account for. Codex's feedback goes back to Claude, which revises the plan. This loop typically runs two or three times before both agents converge on something solid. If the plan contains distinct phases or sub-tasks, each phase goes through the same review cycle before the next one begins.

Only after the plan is stable does implementation start. After implementation, commits are also reviewed — depending on the size and complexity of the change, another agent pass catches issues that slipped through the planning phase. The key insight is that iterating on a plan is far cheaper than iterating on written code. Most problems surface before a single line is implemented.

Docker, VMs, and NFS

AI agents do not just generate text — they execute code, install dependencies, run tests, and interact with the filesystem. Running that inside your host machine is a mistake you make once. My setup uses Docker containers running inside a virtual machine. Project files are exposed to the containers via an NFS mount. The agents work on the real codebase without having any direct access to the host.

This layered isolation matters in practice. If an agent installs an unexpected package, runs a command that has side effects, or produces code that fails loudly at runtime, the blast radius is contained within the container. The VM boundary is a second containment layer. Containers can be rebuilt cleanly between sessions, so there is no accumulated state from one task contaminating the next.

Using multiple SSH sessions into the VM, it is straightforward to run several agents in parallel — different agents on different tasks, or different agents on different projects simultaneously. The NFS setup means each container sees the same files without duplication or synchronisation overhead.

One workflow, many stacks

The range of projects I apply this workflow to is wider than you might expect: websites, Flutter/Dart mobile applications, C programs, Laravel backends, React and Vue frontends, study material preparation for efficiently working through large bodies of content, and text writing. The stacks are different in almost every relevant way — language, runtime, build tooling, testing approach.

The plan-review-iterate workflow is stack-agnostic. The discipline is in the process, not the tooling. What matters is giving the agent clear context about what already exists, what the task requires, and what constraints apply. AI agents handle context-switching between stacks better than I initially expected, provided that context is supplied explicitly rather than assumed.

The same Docker-in-VM-over-NFS infrastructure accommodates all of these stacks with minimal reconfiguration. The container image changes; the workflow does not.

Why two models beat one

The effectiveness of the dual-agent review comes down to the fact that each model has different training data, different architectural choices, and different tendencies in how it approaches problems. Claude tends toward thoroughness and sometimes over-engineers solutions. Codex is often more direct and will flag when Claude has added complexity that the problem does not require. The inverse also happens: Codex can miss edge cases that Claude catches when positions are reversed.

The adversarial dynamic — one agent building, another actively looking for problems — mirrors what happens in a functioning human code review process. In concrete terms, the review loop regularly catches: missing error handling for uncommon but valid inputs, overly permissive configurations, incorrect assumptions about how existing code behaves, abstractions that are technically correct but harder to maintain than a simpler alternative, and security issues that are easy to overlook when you are focused on functionality.

This is not a claim that one model is superior to the other. The value is in the combination. Two agents with different blind spots, in an explicit review relationship, produce stronger output than either produces alone.

An obvious question is whether adding a third agent would improve things further. In practice, the returns diminish sharply. A third agent reviewing the same plan would either agree with one of the existing two — adding no new information — or introduce a third opinion that requires a tiebreaker, slowing convergence rather than helping it. Two agents already cover the constructive and adversarial roles across multiple review rounds. The one exception is post-implementation commit review, which operates on different context (actual code rather than a plan) and genuinely benefits from a fresh set of eyes. Beyond that, stacking more agents on the same task increases token costs and latency without a proportional gain in quality.

The developer stays in the loop

Delegating planning, implementation, and review to AI agents does not mean the developer disappears. I remain the final decision-maker on every output. The agents produce proposals; I decide what gets committed. That distinction matters.

In practice, I intervene at several points: when agents loop on the same disagreement without converging, when the plan has drifted from what was actually requested, when a decision requires domain knowledge or business context the agents do not have, and when the output simply does not look right even if I cannot immediately articulate why. That last one is worth taking seriously — developer instinct developed over years of working in a codebase is not something an agent has.

The role shifts from writing every line to directing, validating, and course-correcting. It is closer to being a tech lead who has very fast junior developers than to being replaced by automation. The efficiency comes from structured oversight, not from removing oversight.

Quality in, quality out

The quality of what an AI agent produces is a direct function of the quality of what you give it. Vague task descriptions produce vague results. An instruction like “improve the authentication flow” will produce something — but whether it is something useful depends entirely on what context you provide about the existing flow, what specifically needs to improve, and what constraints apply.

Good context includes: a precise description of what needs to be done, the relevant file paths, the patterns already in use in the codebase, any constraints (performance, backward compatibility, coding standards), and what a successful outcome looks like. Writing that context takes time, but it is time well spent. Agents that have clear context produce plans that require fewer revision cycles.

Learning to write effective context is a skill that improves with practice. Early attempts tend to be under-specified, and you iterate more. Later attempts are tighter, and the plan-review loop converges faster. Structuring context for an AI agent is closely analogous to writing a good ticket for a human developer: if you cannot explain clearly what you want, the problem is not the tool.

When AI gets it wrong

AI agents are not infallible, and treating them as such is where projects go wrong. They hallucinate — confidently describing APIs or libraries that do not exist. They get stuck in loops, applying the same fix repeatedly when it does not work. They produce code that is plausible-looking but incorrect, and they will sometimes defend wrong approaches with apparent conviction.

Common failure patterns I have encountered: an agent repeatedly attempts a fix that does not address the root cause; an agent invents a function or method that does not exist in the library it is working with; an agent misunderstands the existing codebase despite being given relevant context, because the context was incomplete or misleading. Recognising these patterns early matters. When output stops making progress or starts going in circles, continuing to feed the loop is counterproductive.

When this happens, the right response is to intervene: reset the context, rephrase the problem, break the task into a smaller chunk that is easier to reason about, or sometimes just do that part manually. If the output is simply bad, throw it away and start fresh. Sunk cost thinking does not apply to generated code — there is no craftsmanship invested in it that is worth preserving.

The planning-first workflow reduces these failures because most issues surface during the plan review phase, before any implementation has occurred. A bad plan is much cheaper to discard than bad code.

Running AI agents safely

AI agents that execute code, install packages, and write to the filesystem need to run in an isolated environment. This is not optional. The Docker-in-VM setup I use provides layered containment: unexpected behaviour from an agent affects the container, not the host. The VM boundary provides a second layer. Even if something goes badly wrong inside a container, the host machine and other projects are unaffected.

AI-generated code can introduce security vulnerabilities: SQL injection through improperly parameterised queries, insecure default configurations, overly permissive file or network access, missing input validation. The review loop helps catch these, but it does not eliminate them. Awareness and a security-conscious final review remain necessary.

One practice worth being explicit about: do not feed sensitive credentials, API keys, or production data into AI agent contexts unless the environment is properly secured and you understand where that data goes. The convenience of giving an agent full context is real, but so is the risk of doing so carelessly.

A practical tool, not a replacement

AI agents are a force multiplier for developers who apply structure and discipline to the process. The limited context window — often cited as a limitation — turns out to impose a discipline that is worth having regardless: plan carefully, work in well-scoped chunks, review before executing. These are good engineering practices with or without AI involvement.

The workflow described in this article is not theoretical. It has been applied across websites, Flutter mobile applications, C programs, Laravel backends, React and Vue frontends, study material preparation, and text writing. The process adapts to the stack; the stack does not dictate the process.

The developer who understands the problem, directs the process, and validates the output is still the one delivering the result. AI agents make that faster. They do not make it unnecessary.

Have questions about integrating AI into your development workflow? Feel free to get in touch.