Model Comparison
Which model should I use? Simple answers below, select any rows in the table for side-by-side.
Which model for what?
Complex coding / large refactors
Best SWE-bench (72.5%), 1M context reads entire codebases.
Daily coding assistant (fast + cheap)
Best speed/quality ratio. $3/M in, still 93.8% HumanEval.
Batch processing / pipelines
$0.10/M in. Route 1M messages/day for $4 total.
Self-hosted / private (best open source)
94% HumanEval, MIT license, cheapest API if you don't self-host.
Self-hosted on single GPU
3B active params = fast on RTX 4090. Hybrid thinking mode.
Hard math / science / reasoning
96.7% AIME. Use reasoning_effort='high'. R1 is the open-source alternative.
Budget reasoning
$0.30/M in. Reasoning at 1/10th the cost of o3.
Long documents (500K+ tokens)
1M context (Gemini) or 10M (Scout, open source).
Multimodal (images + video)
Open source, Apache 2.0, native image+video. #3 on LMSYS.
RAG / search embeddings
OpenAI for quality, Nomic for self-hosted zero-cost.
Code completion in editor
92.7% HumanEval, Apache 2.0, self-hosted Copilot replacement.
On AWS (via Bedrock)
Both available on Bedrock. No external API keys needed.
Price tiers (input per 1M tokens)
Free / self-host
Llama 4 Maverick
Llama 4 Scout
Llama 3.3 70B
DeepSeek R1
DeepSeek V3
DeepSeek V3.2
Qwen 3 235B
Qwen 3 30B
Qwen 2.5 Coder 32B
Mistral Large 3
Ministral 3 (3B/8B/14B)
Codestral 25.01
Gemma 4 31B Dense
Gemma 4 27B MoE
Gemma 4 E4B
Gemma 3 27B
Phi-4
< $1/M
Claude Haiku 4.5
GPT-4.1 mini
GPT-4.1 nano
Gemini 2.5 Flash
Grok 3 mini
Kimi K2.5
Moonshot v1 (8K/32K/128K)
$1–5/M
Claude Opus 4.7
Claude Sonnet 4.6
GPT-4.1
o3
o4-mini
Gemini 2.5 Pro
Grok 3
> $5/M
Claude Opus 4.6
Full comparison
| Type | Strengths | |||||||
|---|---|---|---|---|---|---|---|---|
| Claude Haiku 4.5 | Anthropic | CLOSED | 200K tokens | $0.80/M | 88.5% | — | speedcoststructured output | |
| Claude Opus 4.6 | Anthropic | CLOSED | 1M tokens | $15/M | 95.4% | 72.5% | reasoninglong contexttool use | |
| Claude Opus 4.7 | Anthropic | CLOSED | 1M tokens | $5/M | — | step-change over Opus 4.6 | agentic codingnew tokenizeradaptive thinking | |
| Claude Sonnet 4.6 | Anthropic | CLOSED | 200K tokens | $3/M | 93.8% | 65.2% | speedcost-efficiencycoding | |
| Codestral 25.01 | Mistral | OPEN | 256K tokens | self-host | 91.0% | — | code completionFIM (fill-in-middle)80+ languages | |
| DeepSeek R1 | DeepSeek | OPEN | 128K tokens | self-host | — | 49.2% | reasoningmathcoding | |
| DeepSeek V3 | DeepSeek | OPEN | 128K tokens | self-host | 92.1% | — | codingmathMoE 37B active / 671B total | |
| DeepSeek V3.2 | DeepSeek | OPEN | 164K tokens | self-host | 94.0% | — | codingmathsparse attention (DSA) | |
| Gemini 2.5 Flash | CLOSED | 1M tokens | $0.15/M | — | — | speedcostlong context | ||
| Gemini 2.5 Pro | CLOSED | 1M tokens | $1.25/M | — | 63.8% | multimodallong contextsearch grounding | ||
| Gemma 3 27B | OPEN | 128K tokens | self-host | 78.0% | — | compactmultimodalruns on single GPU | ||
| Gemma 4 27B MoE | OPEN | 128K tokens | self-host | — | — | MoE efficiencymultimodalimages + video | ||
| Gemma 4 31B Dense | OPEN | 256K tokens | self-host | — | — | multimodalimages + video35+ languages | ||
| Gemma 4 E4B | OPEN | 128K tokens | self-host | — | — | tinyon-devicemultimodal + audio | ||
| GPT-4.1 | OpenAI | CLOSED | 1M tokens | $2/M | 95.3% | 54.6% | codinginstruction followinglong context | |
| GPT-4.1 mini | OpenAI | CLOSED | 1M tokens | $0.40/M | 92.5% | 28.8% | costspeedlong context | |
| GPT-4.1 nano | OpenAI | CLOSED | 1M tokens | $0.10/M | — | — | ultra-cheapfastclassification | |
| Grok 3 | xAI | CLOSED | 128K tokens | $3/M | — | — | reasoningreal-time datamath | |
| Grok 3 mini | xAI | CLOSED | 128K tokens | $0.30/M | — | — | fast reasoningvery cheapmath | |
| Kimi K2.5 | Moonshot AI | CLOSED | 256K tokens | $0.55/M | — | — | reasoningmultimodalcheap | |
| Llama 3.3 70B | Meta | OPEN | 128K tokens | self-host | 88.4% | — | mature ecosystemfine-tuning friendlywide hardware support | |
| Llama 4 Maverick | Meta | OPEN | 1M tokens | self-host | 84.8% | — | multilingualmultimodalMoE architecture | |
| Llama 4 Scout | Meta | OPEN | 10M tokens | self-host | — | — | longest context (10M)MoE 17B active / 109B totalfits single H100 | |
| Ministral 3 (3B/8B/14B) | Mistral | OPEN | 128K tokens | self-host | — | — | edge-friendlyvisiondense | |
| Mistral Large 3 | Mistral | OPEN | 256K tokens | self-host | — | — | MoE 41B active / 675B totalmultilingualfunction calling | |
| Moonshot v1 (8K/32K/128K) | Moonshot AI | CLOSED | 8K / 32K / 128K tokens | $0.14/M | — | — | very cheapno hidden reasoningreliable JSON | |
| o3 | OpenAI | CLOSED | 200K tokens | $2/M | — | 69.1% | reasoningmathscience | |
| o4-mini | OpenAI | CLOSED | 200K tokens | $1.10/M | — | 68.1% | reasoningcodingcost-efficient reasoning | |
| Phi-4 | Microsoft | OPEN | 16K tokens | self-host | — | — | 14B paramsSTEM reasoningMIT license | |
| Qwen 2.5 Coder 32B | Alibaba | OPEN | 128K tokens | self-host | 92.7% | — | code completioncode generationApache 2.0 | |
| Qwen 3 235B | Alibaba | OPEN | 128K tokens | self-host | 90.2% | — | hybrid thinkingMoE 22B activeApache 2.0 | |
| Qwen 3 30B | Alibaba | OPEN | 128K tokens | self-host | — | — | MoE 3B active / 30B totalruns on consumer GPUhybrid thinking |
Click a row to select it for side-by-side comparison · click any column header to sort