Pricing guide

LLM API cache and batch pricing: when headline token prices are not enough

Input and output token prices are still the baseline. But repeated context, delayed jobs, and optional processing tiers can move the real cost check to cached input, cache write, batch, priority, or flex rows.

Updated June 18, 2026 Cache and batch pricing Works with calculator and compare

Decision path

Start with the workload timing

A lower headline token price is not always the cheapest production shape. First decide whether the workload is interactive, repeated, offline, or tied to a processing tier.

Workload pattern	Pricing rows to inspect	Best next step
One-at-a-time product requests	Input and output	Estimate the monthly request shape in the calculator.
Repeated instructions, documents, or system context	Cached input and cache write	Start from the cache reuse preset, then change the hit rate.
Offline analysis or large jobs that can wait	Batch input and batch output	Use compare to check which models expose batch rows.
Latency or capacity tier decisions	Priority and flex input/output	Compare the optional processing rows beside standard prices.

Calculator

Use the cache reuse preset when repeated context is plausible

The calculator includes a cache reuse preset with 8k input tokens, 1k output tokens, 5k monthly requests, and a 50% cache hit rate. Treat that as a starting point, not a promise from any provider.

If a selected model exposes cached-input pricing, the calculator uses that row for the cached share. If it does not, the estimate falls back to ordinary input pricing so the scenario still has a conservative first-pass cost.

Try cache reuse See calculator examples

Cache check

Does the repeated text meet the provider's caching requirements?
Is cache write pricing separate from cache hit pricing?
Does the cache expire before the repeated requests arrive?

Compare

Use compare for rows the calculator does not model yet

The compare page shows more pricing dimensions than a single monthly calculator scenario can express. Use it to inspect cached input, cache write, batch input, batch output, priority input/output, and flex input/output beside ordinary token prices.

This is especially useful when two models look similar on standard input/output cost, but one route exposes a cheaper asynchronous or cache-aware path for the workload.

Open compare Review data sources

Do not skip

Batch and cache rows do not replace provider documentation. Confirm eligibility, rate limits, retention, latency, and billing behavior before treating a route as production-ready.

Provider docs

Check the provider rule before trusting the estimate

OpenAI

Check the pricing page for cached-input, Batch API, priority, and flex rows, then use the prompt caching guide to confirm eligibility and cached-token reporting.

Pricing Prompt caching

Anthropic

Check cache write, cache hit, batch, and long-context notes before treating a Claude workload as a simple input/output estimate.

Pricing Batch processing

Gemini

Check context caching token prices, storage prices, and cache TTL behavior before estimating repeated-context workloads.

Pricing Caching

Limits

What this guide does not decide

This guide does not rank model quality, latency, quotas, regional availability, tool charges, account discounts, or final invoices. It also does not guarantee cache eligibility for any prompt. Use it to choose the right pricing rows, then verify the production contract with the provider.

LLM API cache and batch pricing: when headline token prices are not enough

Start with the workload timing

Use the cache reuse preset when repeated context is plausible

Use compare for rows the calculator does not model yet

Check the provider rule before trusting the estimate

OpenAI

Anthropic

Gemini

See pricing in action

Calculator examples for four workloads

Cheapest chat models for a 500-token chatbot

Cheapest models for a 2,100-token RAG workload

What this guide does not decide