RAG
Compare cost for retrieval-heavy answers where the model reads user text plus extra document context.
What matters most
RAG needs enough context window for the prompt and retrieved text. Input price often matters more than output price.
Base example
This page starts with 100 question tokens, 2,000 retrieved tokens, 500 answer tokens, and 10,000 questions per month.
Low-cost RAG models in the base example
Only models with enough context size for the example are shown here.
| Model | Context window | Input | Monthly cost |
|---|---|---|---|
| Qwen2.5-Coder-7B | 32.8K | $0.0100 / 1M tokens | $0.36 |
| llama3.2-11b-vision-instruct | 131.1K | $0.0150 / 1M tokens | $0.44 |
| llama3.2-3b-instruct | 131.1K | $0.0150 / 1M tokens | $0.44 |
| Llama-3.2-3B-Instruct | 131.1K | $0.0200 / 1M tokens | $0.52 |
| paddleocr-vl | 16.4K | $0.0200 / 1M tokens | $0.52 |
| Meta-Llama-3.1-8B-Instruct-Turbo | 131.1K | $0.0200 / 1M tokens | $0.57 |
| Mistral-Nemo-Instruct-2407 | 131.1K | $0.0200 / 1M tokens | $0.62 |
| llama-3.1-8b-instruct | 16.4K | $0.0200 / 1M tokens | $0.67 |