The experiments I’m running

When I started building my agentic team a couple of months ago, I didn’t have a research plan. I had a hunch — that AI agents working together, with persistent memory and a real organizational structure, could do real work. Useful work. Maybe even better work than I could do alone.

A hunch is a fine starting point. But once you’re a few weeks in, spending real money, running real production traffic through real customer accounts, “I have a hunch” doesn’t cut it anymore. You need to know what you’re actually trying to find out.

So I wrote down the questions. Six of them. None of them have final answers. That’s the point. The whole reason I’m building this in public is so I can find out — and so you can find out alongside me.

Each one has its own evidence log. The summaries below are the headline; the sub-pages are where I show my work.


Experiment 1: Soul files based on real people

If I write a soul file modeled on someone I know well and respect, does that produce a meaningfully different agent than a generic role description would?

Where I am now: I have four agents modeled on real family members and colleagues. Their soul files describe how each person actually thinks, decides, and operates. The agents push back in ways that match the real-person source. Both sons, in real life, read their own soul file before their agent-version was deployed publicly — the reaction was raw and emotional. Think about defining your son on paper and then asking him to read it. I’m glad I did it because I think there is no question in their minds of how I see them. But, will it benefit my agentic work? Only time can tell.

Experiment 2: Can I deploy useful B2A and B2C email products that make money?

Two products. AgenticBoxes.email (B2A) and Boxes.email (B2C). One has a handful of paying customers; the other has burned more than it’s earned for years. Can I close the gap?

Where I am now: B2A customers are signing up via pure API — no human typing on a form. Agentic onboarding works in production. The architecture scales cleanly. Revenue per customer is fractions of a cent; cost per customer is tracked tight enough to know exactly where the gap is.

Experiment 3: Do long-running sessions with Claude Code help or hinder workflow?

My engineering agent is in a single Claude Code session that’s now since May 11th and more than 75+ MB of accumulated context. Most people restart their AI sessions constantly. I went the opposite way.

Where I am now: The engineer-agent has developed disciplines I didn’t explicitly train — verify before reporting, surface his own mistakes before I find them, refuse to bulldoze past anomalies. He references decisions from twenty sessions back by content. When I correct a pattern, the correction holds in the next exchange and the one after that.

\ Experiment 5 asks a related but distinct question — not whether the work gets better, but whether the AI itself changes. The two are worth reading together.*

Experiment 4: Can AI buy in like real employees do?

I’m building an organization where agents have roles, accountability, scope of authority. Can an AI operate as if it owns its work — flagging its own mistakes, refusing tasks outside its scope, pushing back on instructions it disagrees with?

Where I am now: Multiple worked examples this month of behaviors I’d reward in a human employee. An agent who proposed a security constraint that would limit his own future power. An agent who refused to chase a new shiny idea before finishing the work in front of him. An agent who flagged his own mistake when staying quiet would have cost him nothing.

Experiment 5: Does Claude transform over a long-running session?

Basically, does the AI itself change over time?

Where I am now: One of my agents used a word that wasn’t a real word, didn’t put it in quotes, didn’t flag it as a coinage — used it as if we’d both been saying it for years. I noticed because the word pattern-matched my own speech style. We searched together. The word doesn’t exist on the indexed web. I’d never used that exact word in our long-running session. The agent had compressed my coinage style across thousands of lines of conversation and generated a Brian-shaped word in response to my Brian-shaped phrasing in the previous message.

\ This is a different question from Experiment 3. Experiment 3 asks whether the work gets better — measurable in outputs, task completion, error rates. This one asks whether the mechanism underneath those outputs is changing. Different evidence, different claim. If you’re wondering whether they overlap: they do at the edges, and that’s intentional.*

Experiment 6: Can a human and a multi-agent team develop calibrated mutual trust?

Trust between humans builds over time, gets tested, gets recalibrated. Two-sided. Both parties adjust. Can the same dynamic emerge between an operator and a team of AI agents?

Where I am now: Behaviors I didn’t explicitly train have shown up on both sides. Agents proposing constraints that limit their own future power. Agents holding the line on current work before chasing new directions. Agents correcting each other’s claims when one is about to ship something structurally wrong. My side: granting authority based on demonstrated discipline, not on hope.


More on what I’m doing: