Model Comparison

Which model should I use? Simple answers below, select any rows in the table for side-by-side.

Which model for what?

Complex coding / large refactors

Claude Opus 4.6oro3

Best SWE-bench (72.5%), 1M context reads entire codebases.

Daily coding assistant (fast + cheap)

Claude Sonnet 4.6orGPT-4.1

Best speed/quality ratio. $3/M in, still 93.8% HumanEval.

Batch processing / pipelines

GPT-4.1 nanoorGemini 2.5 Flash

$0.10/M in. Route 1M messages/day for $4 total.

Self-hosted / private (best open source)

DeepSeek V3.2orQwen 3 235B

94% HumanEval, MIT license, cheapest API if you don't self-host.

Self-hosted on single GPU

Qwen 3 30B (MoE)orGemma 4 27B MoE

3B active params = fast on RTX 4090. Hybrid thinking mode.

Hard math / science / reasoning

o3orDeepSeek R1

96.7% AIME. Use reasoning_effort='high'. R1 is the open-source alternative.

Budget reasoning

Grok 3 minioro4-mini

$0.30/M in. Reasoning at 1/10th the cost of o3.

Long documents (500K+ tokens)

Gemini 2.5 ProorLlama 4 Scout

1M context (Gemini) or 10M (Scout, open source).

Multimodal (images + video)

Gemma 4 31B DenseorGemini 2.5 Pro

Open source, Apache 2.0, native image+video. #3 on LMSYS.

RAG / search embeddings

text-embedding-3-largeorNomic Embed v2 MoE

OpenAI for quality, Nomic for self-hosted zero-cost.

Code completion in editor

Qwen 2.5 Coder 32BorCodestral 25.01

92.7% HumanEval, Apache 2.0, self-hosted Copilot replacement.

On AWS (via Bedrock)

Claude Sonnet 4.6orClaude Opus 4.6

Both available on Bedrock. No external API keys needed.

Price tiers (input per 1M tokens)

Free / self-host

Llama 4 Maverick

Llama 4 Scout

Llama 3.3 70B

DeepSeek R1

DeepSeek V3

DeepSeek V3.2

Qwen 3 235B

Qwen 3 30B

Qwen 2.5 Coder 32B

Mistral Large 3

Ministral 3 (3B/8B/14B)

Codestral 25.01

Gemma 4 31B Dense

Gemma 4 27B MoE

Gemma 4 E4B

Gemma 3 27B

Phi-4

< $1/M

Claude Haiku 4.5

GPT-4.1 mini

GPT-4.1 nano

Gemini 2.5 Flash

Grok 3 mini

Kimi K2.5

Moonshot v1 (8K/32K/128K)

$1–5/M

Claude Opus 4.7

Claude Sonnet 4.6

GPT-4.1

o3

o4-mini

Gemini 2.5 Pro

Grok 3

> $5/M

Claude Opus 4.6

Full comparison

32 of 32 models
TypeStrengths
Claude Haiku 4.5AnthropicCLOSED200K tokens$0.80/M88.5%
speedcoststructured output
Claude Opus 4.6AnthropicCLOSED1M tokens$15/M95.4%72.5%
reasoninglong contexttool use
Claude Opus 4.7AnthropicCLOSED1M tokens$5/Mstep-change over Opus 4.6
agentic codingnew tokenizeradaptive thinking
Claude Sonnet 4.6AnthropicCLOSED200K tokens$3/M93.8%65.2%
speedcost-efficiencycoding
Codestral 25.01MistralOPEN256K tokensself-host91.0%
code completionFIM (fill-in-middle)80+ languages
DeepSeek R1DeepSeekOPEN128K tokensself-host49.2%
reasoningmathcoding
DeepSeek V3DeepSeekOPEN128K tokensself-host92.1%
codingmathMoE 37B active / 671B total
DeepSeek V3.2DeepSeekOPEN164K tokensself-host94.0%
codingmathsparse attention (DSA)
Gemini 2.5 FlashGoogleCLOSED1M tokens$0.15/M
speedcostlong context
Gemini 2.5 ProGoogleCLOSED1M tokens$1.25/M63.8%
multimodallong contextsearch grounding
Gemma 3 27BGoogleOPEN128K tokensself-host78.0%
compactmultimodalruns on single GPU
Gemma 4 27B MoEGoogleOPEN128K tokensself-host
MoE efficiencymultimodalimages + video
Gemma 4 31B DenseGoogleOPEN256K tokensself-host
multimodalimages + video35+ languages
Gemma 4 E4BGoogleOPEN128K tokensself-host
tinyon-devicemultimodal + audio
GPT-4.1OpenAICLOSED1M tokens$2/M95.3%54.6%
codinginstruction followinglong context
GPT-4.1 miniOpenAICLOSED1M tokens$0.40/M92.5%28.8%
costspeedlong context
GPT-4.1 nanoOpenAICLOSED1M tokens$0.10/M
ultra-cheapfastclassification
Grok 3xAICLOSED128K tokens$3/M
reasoningreal-time datamath
Grok 3 minixAICLOSED128K tokens$0.30/M
fast reasoningvery cheapmath
Kimi K2.5Moonshot AICLOSED256K tokens$0.55/M
reasoningmultimodalcheap
Llama 3.3 70BMetaOPEN128K tokensself-host88.4%
mature ecosystemfine-tuning friendlywide hardware support
Llama 4 MaverickMetaOPEN1M tokensself-host84.8%
multilingualmultimodalMoE architecture
Llama 4 ScoutMetaOPEN10M tokensself-host
longest context (10M)MoE 17B active / 109B totalfits single H100
Ministral 3 (3B/8B/14B)MistralOPEN128K tokensself-host
edge-friendlyvisiondense
Mistral Large 3MistralOPEN256K tokensself-host
MoE 41B active / 675B totalmultilingualfunction calling
Moonshot v1 (8K/32K/128K)Moonshot AICLOSED8K / 32K / 128K tokens$0.14/M
very cheapno hidden reasoningreliable JSON
o3OpenAICLOSED200K tokens$2/M69.1%
reasoningmathscience
o4-miniOpenAICLOSED200K tokens$1.10/M68.1%
reasoningcodingcost-efficient reasoning
Phi-4MicrosoftOPEN16K tokensself-host
14B paramsSTEM reasoningMIT license
Qwen 2.5 Coder 32BAlibabaOPEN128K tokensself-host92.7%
code completioncode generationApache 2.0
Qwen 3 235BAlibabaOPEN128K tokensself-host90.2%
hybrid thinkingMoE 22B activeApache 2.0
Qwen 3 30BAlibabaOPEN128K tokensself-host
MoE 3B active / 30B totalruns on consumer GPUhybrid thinking

Click a row to select it for side-by-side comparison · click any column header to sort