Three people, ten AI agents, one enterprise fintech app

Three Men in a Repo (To Say Nothing of the AI)

I kicked off a project with three people. A PM. A designer. An engineer. Same job title for all three: Product Builder.

We’re building a production enterprise fintech web app. Real users, real money, real compliance requirements. Not a hackathon demo. And we’re doing it between our day jobs, on calls during lunch, in 90-minute sprints after the kids fall asleep.

Every person on the team builds features end-to-end. The PM vibe-codes. The designer vibe-codes. The engineer vibe-codes. Everyone uses the same AI agent toolkit, from drafting requirements to writing tests to shipping implementation to code review.

The specialist roles still exist, but as guardrails, not bottlenecks. The PM owns the product vision and backlog. The designer owns the design system and UX standards. The engineer owns the tech stack and architecture. Then all three pick up features and build them start to finish.

Why this matters for PMs

I’ve spent 20 years as a PM feeling like a hostage in my own team. I want to A/B test a button color? Create a user story. Beg to squeeze it into a sprint. Write a justification. Wait. Watch it sit in the backlog while devs do spikes, tech talks, ticketing. Two weeks to change a button color. Paddings break. Fix. Dev-QA back and forth. Release delayed. Three and a half months. For a button color.

I’m not blaming developers. That’s just how the system worked. Everything had to go through the pipeline, no matter how small.

A PM who can spin up a prototype, test an idea, validate with real users before writing a single ticket? I think that’s where the role is heading.

The guardrails that make this possible

People hear “AI coding” and picture cowboys pushing to main at 2 AM. No tests. No reviews. Ship and pray. We’re building the opposite.

Ten AI agents. Two pipelines:

Design system pipeline: task-creator, frontend dev, design QA, fixes.

Feature pipeline: PRD writer, PRD reviewer, tech lead, QA dev, QA reviewer, dev, dev reviewer, design QA, PR.

Each agent specialized. Each one adversarial to the one before it.

Every agent reads the rulebook first: architecture, conventions, security, testing, git workflow. 1,800 lines of project law across 7 rule files. Not guidelines. Walls.

Three constraints that made the biggest difference:

Tests are locked. One AI agent writes the tests first (real TDD, not the “we’ll add tests later” version). A different AI agent tries to tear them apart with 15 adversarial attack vectors. False positives? Weak assertions? Missing edge cases? Dead on arrival. Only what survives gets locked. Then, and only then, does the coding agent get to work. If a test fails, the code is wrong. The spec is immutable.

The AI can’t freestyle the UI. 374 design tokens. 90 components organized by atomic design (atoms, molecules, organisms, templates). Small pieces compose deterministically. If something’s missing, the agent stops and flags it. It doesn’t improvise.

Zero tolerance. Every FAIL blocks the pipeline. No “minor issues,” no “fix later.” This was the single biggest quality improvement. A PR reviewer attacks every submission across 17 vectors: scope drift, security holes, race conditions, design system violations. Three rounds max. All green or no merge. No appeals.

The “delete and rebuild” test

We rebuilt our transaction history page three times in a few days. Not because it was broken. Because we wanted to test different UI components and see which one felt right.

Each rebuild took hours, not weeks. Zero fear. Zero bugs. Because 261 tests stayed locked (unit, integration, E2E, mock contracts, accessibility scans). Only the code changed.

We even ran a full delete-and-rebuild exercise: erase all implementation code and see if the AI can rebuild the feature from tests alone.

When rewriting code costs almost nothing, “getting it wrong” stops being scary. You just burn through the bad ideas faster until you find the right one.

The PRD completeness problem

The hardest lesson so far wasn’t about code. It was about specs.

Our PRD reviewer approved a spec after 6 rounds: “READY. Zero FAILs. Zero WARNs. All checks pass.” A 110KB spec with 46 edge cases. It looked bulletproof. We built 3 tickets off it.

Then I checked the spec against the actual API and the existing app. Nine missing requirements: loyalty fields, payment method mapping, decimal formatting, a field rename the reviewer never caught. An entire reusable component that should have existed from the start.

The reviewer didn’t flag wrong things. It confidently blessed an incomplete thing.

That’s the real risk with AI in product development. Not hallucinating problems, but hallucinating completeness. The most dangerous output isn’t a wrong answer. It’s a confident “all checks pass” on an incomplete one.

AI doesn’t need a better reviewer. It needs a PM who knows what the reviewer doesn’t know to check.

Where it stands

Built so far: 27,000 lines of code, 90 UI components, 4 features in flight. Three people, evenings and weekends.

What I’m still figuring out: how to make the pipeline less brittle when requirements change mid-feature, how to share context between agents without blowing up token budgets, and whether the “product builder” model works past three people. I have my doubts about that last one.

More on the agent pipeline itself in a future post.