Three things that surprised me running a personal AI agent in production

May 26, 2026

For the last few months I’ve run a small agent system on a server in my house. It reads from about a dozen sources — three banks, a brokerage, property-management email, a medical-receipts ledger, my own app’s analytics — and assembles one message that lands on my phone twice a day. It drafts replies, files documents by year and category, and flags things I’d otherwise miss. It does not move money, sign anything, or email my tenants on its own.

It’s been live long enough now to be boring, which was the goal. Along the way three things surprised me. Each one cut against what I expected going in, and each turns out to matter well beyond a house.

1. The win was aggregation, not automation.

Going in, I assumed the value would be in the agent doing things — answering email, paying bills, scheduling. That’s the demo everyone builds.

What actually changed my days was duller than that: one message instead of twenty tabs. Before, a normal morning meant logging into App Store Connect, Google Ads, Firebase, three bank accounts, a brokerage, a property inbox, and a receipts ledger — each its own login, each a few minutes, each a context switch that cost more than the minutes. Now it’s one message at 6:30 a.m. and one at 8 p.m. The agent’s main job is to read everything, summarize it, and put it in a single place.

Most of the leverage in a personal agent — and, I’d argue, a business one — is in collapsing scattered state into one view. Automation is the dessert; aggregation is the meal. If you’re scoping an agent product, the unglamorous “read everything, summarize in one place” is probably worth more to the user than the impressive “take action” demo that gets the applause.

2. Constraining the agent made me trust it more — and use it more.

My instinct was that an agent earns its keep by acting, so I expected to keep widening what it was allowed to do on its own. The opposite happened.

The modules I rely on most are the ones that do the least. Property email gets a drafted reply that I approve with a tap; actually sending is gated behind a PIN. Banking is read-only — the agent can see balances and flag an odd charge, but it has no path to move a dollar. Investments are never traded. The medical module reads a sanitized ledger and never the underlying images.

Here’s the part I didn’t predict: the first time an agent does something you didn’t expect — even something small and harmless — you stop trusting it, and an untrusted tool gets quietly closed and never reopened. By keeping everything to draft → approve, I never had that moment. So I left more of it running, gave it more to read, leaned on it more. The constraint increased usage.

A narrow agent I trust beats a capable one I have to babysit. For anyone building agents for other people, the lesson is sharper still: “human in the loop” isn’t a compliance checkbox you bolt on at the end. It’s the feature that decides whether the thing survives contact with a real user. Design the approval step as the product, not as the guardrail around it.

3. The model was the most replaceable part of the system.

Partway through, I switched the underlying model from one provider to another. Not because of quality — because the framework I was running expected a particular model’s tool-calling profile. The swap took an evening, and the system behaved the same the next morning.

What did not swap easily was the plumbing. Read-only, scoped API tokens instead of stored passwords. Running on an isolated machine that is never my work laptop, so a bad instruction can’t reach anything that matters. A redaction step that strips personal information before anything leaves the house — the medical module, for instance, never sends a receipt image downstream at all, only a sanitized ledger. A PIN gate on anything that can send. That layer is where the weeks actually went, and it’s the part I couldn’t rip out without rebuilding the whole thing.

The model is increasingly a commodity you can swap in an evening. The durable engineering is the boring security-and-data-handling layer around it. So when you’re evaluating an “AI feature” — your own or a vendor’s — spend less time on which model and more on three questions: what can it read, what can it write, and where does the data go? That’s the part that’s hard to change later, and it’s the part that decides whether you can trust the thing with anything real.

The pattern underneath

None of this is exotic. A server that’s already on, some scoped tokens, a chat bot, a scheduler. The surprises weren’t technical — they were about where the value actually sat: in reading rather than acting, in constraint rather than capability, in the plumbing rather than the model.

None of this is specific to a personal setup. The same three decide things at company scale, anywhere an agent is put in front of real data and real consequences.

Inference Notes

Discussion about this post

Ready for more?