← Back to articles

Pricing guide

LLM API cache and batch pricing: when headline token prices are not enough

Input and output token prices are still the baseline. But repeated context, delayed jobs, and optional processing tiers can move the real cost check to cached input, cache write, batch, priority, or flex rows.

Updated June 18, 2026 Cache and batch pricing Works with calculator and compare

Decision path

Start with the workload timing

A lower headline token price is not always the cheapest production shape. First decide whether the workload is interactive, repeated, offline, or tied to a processing tier.

Workload pattern Pricing rows to inspect Best next step
One-at-a-time product requests Input and output Estimate the monthly request shape in the calculator.
Repeated instructions, documents, or system context Cached input and cache write Start from the cache reuse preset, then change the hit rate.
Offline analysis or large jobs that can wait Batch input and batch output Use compare to check which models expose batch rows.
Latency or capacity tier decisions Priority and flex input/output Compare the optional processing rows beside standard prices.

Calculator

Use the cache reuse preset when repeated context is plausible

The calculator includes a cache reuse preset with 8k input tokens, 1k output tokens, 5k monthly requests, and a 50% cache hit rate. Treat that as a starting point, not a promise from any provider.

If a selected model exposes cached-input pricing, the calculator uses that row for the cached share. If it does not, the estimate falls back to ordinary input pricing so the scenario still has a conservative first-pass cost.

Cache check

  • Does the repeated text meet the provider's caching requirements?
  • Is cache write pricing separate from cache hit pricing?
  • Does the cache expire before the repeated requests arrive?

Compare

Use compare for rows the calculator does not model yet

The compare page shows more pricing dimensions than a single monthly calculator scenario can express. Use it to inspect cached input, cache write, batch input, batch output, priority input/output, and flex input/output beside ordinary token prices.

This is especially useful when two models look similar on standard input/output cost, but one route exposes a cheaper asynchronous or cache-aware path for the workload.

Do not skip

Batch and cache rows do not replace provider documentation. Confirm eligibility, rate limits, retention, latency, and billing behavior before treating a route as production-ready.

Provider docs

Check the provider rule before trusting the estimate

OpenAI

Check the pricing page for cached-input, Batch API, priority, and flex rows, then use the prompt caching guide to confirm eligibility and cached-token reporting.

Anthropic

Check cache write, cache hit, batch, and long-context notes before treating a Claude workload as a simple input/output estimate.

Gemini

Check context caching token prices, storage prices, and cache TTL behavior before estimating repeated-context workloads.

Limits

What this guide does not decide

This guide does not rank model quality, latency, quotas, regional availability, tool charges, account discounts, or final invoices. It also does not guarantee cache eligibility for any prompt. Use it to choose the right pricing rows, then verify the production contract with the provider.