Inference Notes

The cheapest line item in my AI app is the AI

Sathish J — Fri, 29 May 2026 15:40:16 GMT

When you ship a consumer app that leans on a large language model, the standard cost warning is always the same: watch your inference bill. Cap your usage. The model will eat your margin.

After running mine live for a while, I can report the opposite. The AI is the cheapest, most predictable line item I have — and the economics that actually matter turned out to be somewhere else entirely.

What the model actually costs

The app uses a small, cheap model for its AI features, and I built it to call that model as little as possible:

A small model, not a frontier one, for tasks that don’t need a frontier one.
Hard caps on output length and low temperature — predictable tokens in, predictable tokens out.
Aggressive caching: a cache hit rate north of 80%, so most requests never reach the model at all.
A free, deterministic data source tried first; the model is only the fallback.
A per-user rate limit, so no single user can run up the bill.

Add it up and the cost to serve an active user lands on the order of a tenth of a percent of what they pay for a monthly subscription. On the P&L, the model is a rounding error.

The lesson isn’t “models are cheap now,” though they are. It’s that AI cost is an architecture decision, not a procurement one. You don’t get cheap inference by negotiating price-per-token; you get it by calling the model less — cache, cap, prefer deterministic sources, batch. The team that designs for fewer, smaller calls beats the team with a better rate card. That bullet list above is really a pricing decision disguised as architecture.

The line item that actually moves the margin

The expensive part of the app has nothing to do with AI. It’s distribution.

Put it on a single $10 transaction. Route it through iOS in-app purchase and the platform fee takes $1.50 to $3.00 of it; the inference behind that customer’s usage costs a fraction of a cent. The channel takes a couple hundred times what the model does, on the same dollar. So the app applies the long-established “reader” principle — apps from Netflix to Spotify to Kindle handle commerce on the web and use the iOS app for access. That’s the architecture I ended up with. (The formal App Store carve-out is narrow, and the rules are in flux post-Epic and under the EU’s DMA — so the point is the principle, not any one loophole.)

A pricing person’s way to say it: your distribution channel’s take rate is usually a far bigger lever on margin than your cost of goods. I spend my day setting price inside a hardware P&L at scale, where a point of margin is real money and the structure — where value gets captured, not what it costs to make — dominates the model. The app just let me run the same logic at the opposite extreme: near-zero unit cost, and the structure still decides everything.

Two pricing reflexes that break when cost rounds to zero

When your unit cost rounds to zero, two reflexes that served older businesses well both break.

Cost-plus pricing makes you look cheap to your own product. Price a near-zero-cost feature on cost and you’ll price it near nothing — which tells the buyer it’s worth near nothing.

Per-usage metering makes the user anxious about a meter that’s measuring almost nothing — you collect all of the anxiety and almost none of the revenue.

Metering isn’t always wrong, though. It’s the right model when each call does expensive, legible work — a coding agent like Cursor or Claude Code, a per-resolution support agent — because there the meter tracks something the customer can see is worth paying for. It fails only when both the per-call cost and the per-call value are low, which is exactly the consumer-app case: each call is cheap and invisible, so the meter just generates worry. There, bundle pricing and value-anchored tiers win.

The same logic set my lifetime tier. I didn’t price it from cost; I priced it for what someone would pay for the certainty of never being asked to pay again — a willingness-to-pay number, not a cost number. A lifetime plan is normally a recurring-revenue anti-pattern: you trade a stream for a lump and a permanent obligation to serve. I took it deliberately, precisely because near-zero marginal cost makes that obligation nearly free to honor.

Then, at launch, I gave Pro away to the first cohort of families. That’s revenue I’m choosing not to collect — and it’s a standard early-stage sequencing move, not generosity. At this stage the binding constraint isn’t monetization; it’s getting enough engaged users in the door to learn what makes them stay. The risk you take on is anchoring — teach people the thing is free and some never convert — which is why it’s the first cohort, not forever.

Where the economics actually live

Once inference rounds to zero and you’ve protected your take rate, unit economics stop being about cost to serve at all. They’re dominated by activation and retention. If people sign up and ghost, no pricing tier saves you. If they stick, the model cost is noise. In a world of cheap inference, the highest-leverage “pricing” work often isn’t pricing — it’s getting someone to the “oh, I get it” moment fast enough that they come back.

The transferable part

If you price or build AI products, retire one reflex: treating token cost as the central economic question. It was the right worry in 2023; it’s increasingly the wrong one for products shaped like mine. The honest caveat is the frontier: multi-step agents and reasoning loops have pushed per-task cost up even as per-token prices fall, and cost still matters enormously there. But for the broad middle of consumer and SaaS AI features, cost-to-serve is trending toward noise.

Watch where the supply side is pushing it. NVIDIA’s published comparison for its next-generation Vera Rubin platform puts inference cost per token at roughly a tenth of the current Blackwell generation’s at the same latency — on one reasoning workload, with the usual caveat that the number moves with model and operating point. The cost floor under inference keeps dropping. But the direction was never the interesting part. The interesting question is who keeps the margin when a token costs almost nothing — and it isn’t the resellers metering tokens, or the SaaS priced on cost. It accrues to whoever prices the system and the relationship. That’s the same lesson the app taught me, three orders of magnitude down.

Inference cost per token vs. latency — Rubin NVL72 vs. Blackwell NVL72, on one reasoning workload (Kimi K2-Thinking, 32K/8K). Source: NVIDIA.

The questions that decide whether an AI product makes money are the old, unglamorous ones: what take rate your channel imposes, how you package against willingness to pay, and whether you keep the user long enough for any of it to matter.

Price the value and the relationship — not the inference. The inference was never going to be the expensive part.

Three things that surprised me running a personal AI agent in production

Sathish J — Tue, 26 May 2026 16:43:49 GMT

For the last few months I’ve run a small agent system on a server in my house. It reads from about a dozen sources — three banks, a brokerage, property-management email, a medical-receipts ledger, my own app’s analytics — and assembles one message that lands on my phone twice a day. It drafts replies, files documents by year and category, and flags things I’d otherwise miss. It does not move money, sign anything, or email my tenants on its own.

It’s been live long enough now to be boring, which was the goal. Along the way three things surprised me. Each one cut against what I expected going in, and each turns out to matter well beyond a house.

1. The win was aggregation, not automation.

Going in, I assumed the value would be in the agent doing things — answering email, paying bills, scheduling. That’s the demo everyone builds.

What actually changed my days was duller than that: one message instead of twenty tabs. Before, a normal morning meant logging into App Store Connect, Google Ads, Firebase, three bank accounts, a brokerage, a property inbox, and a receipts ledger — each its own login, each a few minutes, each a context switch that cost more than the minutes. Now it’s one message at 6:30 a.m. and one at 8 p.m. The agent’s main job is to read everything, summarize it, and put it in a single place.

Most of the leverage in a personal agent — and, I’d argue, a business one — is in collapsing scattered state into one view. Automation is the dessert; aggregation is the meal. If you’re scoping an agent product, the unglamorous “read everything, summarize in one place” is probably worth more to the user than the impressive “take action” demo that gets the applause.

2. Constraining the agent made me trust it more — and use it more.

My instinct was that an agent earns its keep by acting, so I expected to keep widening what it was allowed to do on its own. The opposite happened.

The modules I rely on most are the ones that do the least. Property email gets a drafted reply that I approve with a tap; actually sending is gated behind a PIN. Banking is read-only — the agent can see balances and flag an odd charge, but it has no path to move a dollar. Investments are never traded. The medical module reads a sanitized ledger and never the underlying images.

Here’s the part I didn’t predict: the first time an agent does something you didn’t expect — even something small and harmless — you stop trusting it, and an untrusted tool gets quietly closed and never reopened. By keeping everything to draft → approve, I never had that moment. So I left more of it running, gave it more to read, leaned on it more. The constraint increased usage.

A narrow agent I trust beats a capable one I have to babysit. For anyone building agents for other people, the lesson is sharper still: “human in the loop” isn’t a compliance checkbox you bolt on at the end. It’s the feature that decides whether the thing survives contact with a real user. Design the approval step as the product, not as the guardrail around it.

3. The model was the most replaceable part of the system.

Partway through, I switched the underlying model from one provider to another. Not because of quality — because the framework I was running expected a particular model’s tool-calling profile. The swap took an evening, and the system behaved the same the next morning.

What did not swap easily was the plumbing. Read-only, scoped API tokens instead of stored passwords. Running on an isolated machine that is never my work laptop, so a bad instruction can’t reach anything that matters. A redaction step that strips personal information before anything leaves the house — the medical module, for instance, never sends a receipt image downstream at all, only a sanitized ledger. A PIN gate on anything that can send. That layer is where the weeks actually went, and it’s the part I couldn’t rip out without rebuilding the whole thing.

The model is increasingly a commodity you can swap in an evening. The durable engineering is the boring security-and-data-handling layer around it. So when you’re evaluating an “AI feature” — your own or a vendor’s — spend less time on which model and more on three questions: what can it read, what can it write, and where does the data go? That’s the part that’s hard to change later, and it’s the part that decides whether you can trust the thing with anything real.

The pattern underneath

None of this is exotic. A server that’s already on, some scoped tokens, a chat bot, a scheduler. The surprises weren’t technical — they were about where the value actually sat: in reading rather than acting, in constraint rather than capability, in the plumbing rather than the model.

None of this is specific to a personal setup. The same three decide things at company scale, anywhere an agent is put in front of real data and real consequences.