Why Your AI Bill Is Almost All Inference (and How to Cut It)

June 6, 2026

9 7 minutes read

Why Your AI Bill Is Almost All Inference and How to Cut It

The headlines are all about training. DeepSeek’s $5.6M run, the rumored $100M-plus frontier models, the data-center buildouts measured in gigawatts. Enormous numbers, and almost none of them are your problem.

Your problem is the invoice that lands every month and keeps growing. That’s not training. That’s inference cost — the price of actually running the model on real traffic — and for anyone who has shipped an LLM feature, it’s where the money goes.

Training is the down payment. Inference is the rent. You pay the down payment once; the rent shows up every month for as long as the app is live, and it scales with every user, every request, every token. NVIDIA and AWS have long estimated that inference accounts for somewhere around 80–90% of an AI system’s lifetime compute — and once you’ve shipped, that estimate stops being a statistic and starts being your finance conversation.

Training is the down payment, inference is the rent
The paradox: prices crater while your bill climbs
What actually drives inference cost
How to cut it: the levers that move the needle
A worked example
How to measure and watch it
FAQ

Training is the down payment, inference is the rent

Training is a fixed job. It runs for days or weeks, it finishes, the cost stops. You might retrain or fine-tune occasionally, but those are discrete events you can plan for.

Inference is the opposite. It begins the moment you ship and never stops as long as someone is hitting your API. There’s no “done.” A model serving real users generates enough token volume to eclipse its original training bill within weeks, and then just keeps going. That structural difference — bounded one-time cost versus unbounded continuous cost — is the whole reason the inference share is so lopsided.

The exact figure varies by who’s measuring: the classic NVIDIA/AWS estimate is 80–90% of lifetime cost, while some 2026 production-FinOps analyses land closer to 70–80% of GPU spend. The number moves; the direction doesn’t. After launch, inference dominates, and it’s the line item where your engineering decisions actually show up on the bill.

The paradox: prices crater while your bill climbs

Here’s what confuses people. Per-token prices are in freefall. Stanford’s 2025 AI Index found that querying a GPT-3.5-level model dropped from $20.00 per million tokens in late 2022 to $0.07 by late 2024 — a more than 280-fold collapse in about 18 months. Depending on the task, Epoch AI estimates inference prices have been falling anywhere from 9× to 900× per year.

So why is your bill going up?

Because usage grows faster than prices fall. Cheaper tokens don’t make people use fewer of them — they make people use far more. The same dynamic plays out everywhere: when something gets cheap, consumption explodes to fill the gap. And the single biggest accelerant in 2026 is agentic workflows. A single user request used to be one model call. Now it’s a planning step, three tool calls, a reflection pass, and a final summary — ten-plus calls for one task, each consuming tokens. Per-token prices dropped two orders of magnitude; per-task token counts went up by more.

⚠️ Note: “The tokens got cheaper, so we don’t need to optimize” is the trap. Cheaper tokens plus exploding usage is exactly how teams end up with a bill that quadrupled in a quarter despite switching to a cheaper model. Volume is the variable that bites.

What actually drives inference cost

Four things, in rough order of impact:

Token volume — total tokens in and out across all requests. This is usually the dominant term, and the one agentic patterns inflate.
Model size — a 70B model costs far more per token to serve than an 8B one. Picking a bigger model than the task needs is the most common silent overspend.
Context length — every token you stuff into the prompt is a token you pay to process, on every single call. Long retrieved context and bloated system prompts add up fast (and, past a point, they make answers worse, not just pricier — see context collapse).
Output length — generated tokens are typically the most expensive kind, and the slowest, because they’re produced one at a time.

If you serve your own model rather than calling an API, the underlying unit is cost-per-million-tokens (CPM), and it comes down to how much your hardware costs per hour divided by how fast it can produce tokens:

CPM = (cluster cost per hour) / (tokens per second × 3600 / 1,000,000) Example: an 8×H100 pod at ~$19/hr serving a 70B model at ~2,800 tok/s ≈ $19 / (2,800 × 3600 / 1,000,000) ≈ $1.90 per million tokens

Throughput is the lever most teams forget exists. Doubling tokens-per-second halves your cost-per-token on the same hardware — which is why batching and quantization (below) matter so much.

How to cut it: the levers that move the needle

In the order I’d reach for them:

1. Right-size the model — and route. The cheapest token is the one a smaller model generates correctly. Most production traffic doesn’t need your most capable model. Route easy requests to a small, cheap model and reserve the expensive one for the queries that actually need it:

# Send the cheap model first; escalate only when it's not confident enough.
def answer(query: str) -> str:
    draft = cheap_model(query)          # small, fast, ~10-20x cheaper per token
    if confidence(draft) >= THRESHOLD:
        return draft
    return strong_model(query)          # only pay for this when it's warrantedCode language: Python (python)

A good router can cut spend 40–70% with no user-visible quality loss, because the expensive model was overkill for most of what you were sending it.

2. Cache aggressively. If your prompts share a large static prefix — a system prompt, a long instruction block, retrieved docs reused across turns — prompt caching lets you pay for that prefix once instead of on every call. For repeated or near-identical user queries, semantic caching skips the model entirely and returns a stored answer. Caching is the rare lever that cuts both cost and latency.

3. Spend fewer tokens. Trim the system prompt. Cap max_tokens on output. Retrieve three relevant chunks instead of twenty. This overlaps directly with retrieval quality — tighter context is cheaper and more accurate.

4. Batch what isn’t real-time. Anything that doesn’t need an instant response — overnight document processing, evals, bulk classification — should go through a batch API, often at a steep discount. On self-hosted setups, continuous batching (via servers like vLLM) raises throughput dramatically, which lowers your CPM directly.

5. Quantize. Serving a model at 4-bit instead of 16-bit cuts its memory footprint and raises throughput, usually with minimal quality loss — a big lever for self-hosted inference, covered in quantization explained. Reusing the KV cache across requests with a shared prefix is the closely related memory-side win.

A worked example

Take a 70B model serving 1,000 daily active users, 1,000 requests each, 500 tokens per request. That’s 500 million tokens a day. At roughly $1.90 per million tokens, you’re at about $950/day — call it $28K/month — and that’s before agentic multi-call patterns multiply the request count.

Now apply the stack. Route 70% of that traffic to a model that’s 15× cheaper per token: the bulk of your volume now costs a fraction of what it did. Cache the shared system prompt and reused context: another slice off every remaining call. Cap runaway outputs. It’s entirely realistic to take a deployment like this from the high tens of thousands per month into the teens — the kind of $39K-to-$16K swing that shows up in real FinOps case studies — without users noticing anything except, maybe, slightly faster responses.

💡 Tip — there’s also a build-vs-buy crossover worth knowing. Below roughly 20M tokens/month, a managed API almost always wins on total cost once you count engineering time. Self-hosting tends to break even somewhere around 50–100M tokens/month for a 70B-class model. Don’t stand up GPU infrastructure to save money you’re not yet spending.

How to measure and watch it

You can’t cut what you don’t track, and “the OpenAI bill went up” is not tracking.

Instrument cost per request, broken down by feature and by model. The cleanest way: log token counts (input and output) and the model used on every call, then push that into your analytics layer. If you already run GA4 or Google Tag Manager around the product, fire a custom event per AI request carrying the token counts and a feature tag. After a week you’ll have a per-feature cost dashboard — and almost always a surprise, because one feature you forgot about is quietly burning a third of the budget.

That visibility is what turns cost optimization from guesswork into a ranked to-do list: you optimize the expensive feature first, measure the drop, and move on. Inference cost is one of nine concepts worth keeping sharp in the 2026 AI terms guide for developers — and it’s the one your CFO will eventually ask you about by name.

FAQ

Why is inference more expensive than training for most companies?

Training is a one-time, bounded cost — it runs for days or weeks and stops. Inference runs continuously for as long as your app is live and scales with every user and token. NVIDIA and AWS have estimated inference at roughly 80–90% of an AI system’s lifetime compute cost; more recent production analyses put it at 70–80% of GPU spend. Either way, once you ship, inference dominates.

If token prices keep dropping, why is my AI bill going up?

Because usage is growing faster than prices are falling. Per-token inference prices dropped over 280× between late 2022 and late 2024, but cheaper tokens drive far higher consumption — and agentic workflows now fire ten or more model calls per user task. The per-token price fell; the number of tokens you use rose by more.

What’s the fastest way to cut LLM inference cost?

Right-sizing and routing usually gives the biggest win: send most traffic to a smaller, cheaper model and escalate to the expensive one only when needed, which can cut spend 40–70% with no quality loss. After that, prompt caching, trimming token counts, batching non-real-time work, and quantization for self-hosted serving are the highest-leverage levers.

Should I self-host a model to save on inference cost?

Usually not until you’re at real volume. Below roughly 20M tokens/month, managed APIs win on total cost once engineering time is counted. Self-hosting a 70B-class model tends to break even around 50–100M tokens/month. Self-host for control, data residency, or scale — not to save money you aren’t yet spending.

What drives the cost of a single LLM request?

Four things: total token volume (input + output), the size of the model serving it, the length of the context you send on every call, and the number of tokens generated. Output tokens are typically the priciest because they’re produced one at a time. Cutting context length and capping output length are the quickest per-request savings.

📚 READ NEXT

→ LLM Quantization: FP16 to INT4 Explained — cut VRAM by 75% without wrecking accuracy

→ KV Cache: Why Long Context Isn’t Free — the memory cost behind every long prompt

June 6, 2026

9 7 minutes read

Angular SSR vs Next.js: How to Choose the Right One in 2026

How to Stop Duplicate API Calls in Angular SSR

Angular Render Modes: How to Pick SSR, SSG, or CSR (2026)