← Back to Blog

The UPS Man, the Token Bill, and Why AI Just Hit Day 2

The UPS Man, the Token Bill, and Why AI Just Hit Day 2

My wife used to buy books from Barnes & Noble online. This was the late nineties. The deal was something like spend twenty dollars, get twenty dollars off, new customers only. The catch was that a new customer was just a new email address, and email addresses were free. So we were all, technically, an infinite number of new customers.

It went further than books. There were sites that would literally mail you a dollar bill for signing up. Some sent dollar coins. The UPS man was at our house two or three times a day. Our son decided he wanted to be a UPS driver when he grew up, because the UPS driver showed up with packages and made people happy. From where he stood, that was the whole job. Bring good things to the door. Watch people light up.

Then the money ran out. Not literally, not at first. The companies realized the giveaway was not a business. More to the point, the investors realized it. They started asking the rude question. Where is the path to profit? Barron's ran a cover story in March 2000 called "Burning Up," walking through how fast these companies were torching cash, and predicting a lot of them would be gone within a year. A lot of them were. Pets.com lost $147 million in nine months and shut down nine months after its IPO. Webvan burned through more than $800 million before it folded. The mottos of the era were "get big fast" and "get large or get lost," and a remarkable number of those companies got large, then got lost.

The UPS man came around less and less. Eventually my son moved on to other career plans.

I think about that house a lot when I look at AI right now.

We gave it away

For the first year or two, everyone gave it away. Free accounts with access to compute that was costing the providers thousands of dollars. Tokens priced at pennies on the dollar, or no dollars at all. The whole industry was running the Barnes & Noble play at planetary scale, buying market share with someone else's balance sheet.

It got its own culture. The term that stuck was "tokenmaxxing," burning through as many tokens as humanly possible because heavy usage read as productivity. Amazon and Meta ran internal leaderboards ranking employees by token consumption, like it was a video game. In March, Nvidia's Jensen Huang said he would be "deeply alarmed" if a software engineer making half a million dollars a year wasn't spending a quarter million on tokens. The message from the top was clear. Use more. Use all of it. Figure out how to automate your own job while you're at it.

Then the bill arrived

The land grab is now ending, and it is ending the same way the last one did.

Uber spent its entire 2026 AI budget in the first four months of the year, and its COO said the internal cost was getting "harder to justify." The company capped AI coding tools at $1,500 per employee per month. Microsoft began pulling Claude Code licenses from many of its own developers and steering them toward a cheaper internal tool. Sam Altman, who runs OpenAI, called ballooning token costs "a huge issue" for customers. Free tiers are shrinking and rate limits are tightening across the board.

Here is the part that gets misread, so I want to be precise about it. The price of a single token has actually fallen. Hard. What has gone up is the bill, because agentic systems do not make one call, they make hundreds, sometimes thousands, to answer a single question. The researcher Gary Marcus put the multiplier as high as five hundred to a thousand times the tokens for a single agent task. One trade analysis summed up the whole situation in four words: token prices fell, bills tripled. The free ride is not getting more expensive per unit. It is just no longer free, and the meter never stops.

What changes when the money gets real

You can already see the culture turning. Fewer breathless videos about getting one prompt to write an entire system. Fewer people insisting that the job now is to prompt your way to a finished product before lunch. More of a sober, almost boring question taking its place: where does this thing actually help, and what does it cost to put it there?

Take my biggest consulting project right now. The goal is to capture the knowledge of a room full of experts and put it to work alongside AI. A customer feeds in their data, the system analyzes it, and out come results, opinions, and recommended next steps. Sounds like a tidy LLM use case until you hit the requirement that breaks everything: the same input has to produce the same output, every single time, unless we deliberately change what the system knows.

Now ask yourself. When was the last time you handed an LLM the same data twice and got back the same answer? It does not happen. And this is not a temperature setting you forgot to turn down. Even at temperature zero, even with a fixed seed, the major providers tell you in their own documentation that the output is not guaranteed to be identical. Anthropic's own docs say results will not be fully deterministic even at zero. Thinking Machines Lab, the outfit Mira Murati started after leaving OpenAI, traced the main culprit in 2025 to something almost mundane: the math shifts depending on how your request gets batched with other people's on the server. They also showed it can be fixed. But fixed is not the default, and "we engineered batch-invariant kernels" is not a sentence you want load-bearing in a system a client is betting their decisions on.

So the determinism has to live somewhere other than the model. It lives in graphs, in classical machine learning, in rules and logic that run outside the AI brain. That logic becomes a guardrail, yes, but it is also doing real work. The knowledge gets processed outside the model and inside it, and the two have to agree.

Which means we are back to building ecosystems. Not pointing an entire business at one LLM with a vector database bolted on and calling it an architecture.

RAG was the old way

RAG, in its original retrieve-then-generate form, is already the old way of doing this. It struggles exactly where my project lives, with numerical values, time, and expert rules, because similarity search does not reason. The newer vocabulary tells the story.

Cache-augmented generation (CAG) skips the live retrieval step entirely when the knowledge base is small and stable, preloading everything into the model's context. The paper that introduced it in late 2024 was titled, with some swagger, "Don't Do RAG." Knowledge-augmented generation (KAG), built by Ant Group's team, marries the model to an actual knowledge graph so it can handle the logic and relationships plain RAG fumbles. And the field has split RAG itself into passive and agentic. Passive RAG runs one fixed lookup and hopes for the best. Agentic RAG lets the system decide what to fetch, check whether it got enough, and go back for more. That power costs you, by the way, often three to ten times the tokens of a simpler pipeline, which loops us right back to the bill.

Day 2

We are still early. But I think we finally crossed out of the land grab and into the part that actually counts.

Jeff Bezos used the phrase "Day 2" to mean something grim. Stasis, then decline, then death, which is why he insisted it was always Day 1 at Amazon. I'm borrowing the term and pointing it the other way, the way operators use it. Day 1 is the launch, the giveaway, the UPS man at the door three times a day. Day 2 is the morning after, when you have to run the thing, pay for the thing, and prove the thing was worth building. The free dollars stop arriving in the mail. The real work starts.

I liked Day 1. Everybody likes Day 1. But Day 2 is where you find out who was building a business and who was just enjoying the packages.

Frequently Asked Questions

Token costs are rising because modern AI systems don't make a single call to answer a question. Agentic pipelines make hundreds or thousands of calls per task, with some estimates putting the multiplier at 500 to 1,000 times the tokens of a simple request. So even as the price per token has dropped, the total bill has tripled for many users.

Passive RAG runs a single fixed lookup and uses whatever it gets back. Agentic RAG gives the system control over its own retrieval, letting it decide what to fetch, check whether the result is sufficient, and go back for more if it isn't. That extra capability comes at a real cost, typically three to ten times the tokens of a simpler pipeline.

Output determinism is a problem because LLMs don't reliably return the same answer when given the same input twice. Even at temperature zero with a fixed seed, major providers won't guarantee identical outputs, partly because results shift depending on how requests get batched with other users on the server. For any business system where consistent, repeatable results are required, that unpredictability can't be papered over with a settings tweak -- the determinism has to be engineered outside the model entirely, through graphs, classical machine learning, and logic layers that run independently.

Cache-augmented generation (CAG) preloads a small, stable knowledge base directly into the model's context, skipping live retrieval altogether. Knowledge-augmented generation (KAG) connects the model to a knowledge graph so it can handle logic and relationships that plain RAG can't. Agentic RAG adds another layer by letting the system decide what to fetch, verify whether it got enough, and retrieve more if needed.

The dot-com boom and the current AI moment follow nearly the same script. Both started with massive giveaways funded by investor capital, where companies bought market share instead of building sustainable business models. Just as Pets.com and Webvan burned through hundreds of millions before collapsing when investors demanded a path to profit, AI providers have been subsidizing usage with free tiers and below-cost token pricing, and now the bills are arriving, with companies like Uber and Microsoft already pulling back and capping AI spending as costs become harder to justify.

Let's build
something.

Ready to stop managing chaos and start building leverage? Let's talk about what AI-powered platforms can do for your business.

Request a Strategy Call

Free 30-minute consultation. No obligations.

Get Your Download