AI Models

52 models · 19 new in 60d

Compare →

Live

Sort

Newest A→Z

Type

All Flagship Small Reasoning Code Embedding Image Vision

License

All Open Closed

Claude Opus 4.7New
Anthropic · 1M tokens · $5/M → $25/M
▾
Best for: Most capable generally available model. Complex multi-step coding, long agentic workflows, 1M-token codebase reads.
How: client.messages.create(model='claude-opus-4-7', ...). Adaptive thinking is on by default — no separate extended-thinking mode needed.
Example: Use Claude Code CLI with --model claude-opus-4-7 to handle PR-sized refactors end-to-end in a single run.
SWE-bench step-change over Opus 4.6Context 1M (~555k words)
agentic codingnew tokenizeradaptive thinking1M context128k max output
API: api.anthropic.com (model: claude-opus-4-7) · AWS Bedrock · GCP Vertex AI · Microsoft Foundry
Step-change improvement in agentic coding vs Opus 4.6. New tokenizer means 1M tokens ≈ 555k words (vs 750k for Sonnet 4.6).
Gemma 4 31B DenseNewOpen
Google · 256K tokens · self-host
▾
Best for: Self-hosted multimodal production, commercial use, multilingual apps
How: Dense 31B — fits on a single A100 or 2x RTX 4090. Apache 2.0 = fully commercial. Supports images and video natively.
Example: Deploy as a private multimodal assistant that reads screenshots, logs, and video clips.
LMSYS Arena #3 textMMLU ~82%
multimodalimages + video35+ languagesApache 2.0dense architecture
Hardware to self-host
VRAM: 20GB (quantized) / 62GB (FP16)
GPU: 1× A100 80GB or 2× RTX 4090 24GB
RAM: 32GB+ system RAM
31B dense. Native multimodal (images + video) increases compute cost vs text-only.
API: Ollama, vLLM, Hugging Face, Vertex AI. ollama run gemma4:31b
Brand new (Apr 2026). Ranked #3 on LMSYS Arena text leaderboard at launch.
Gemma 4 27B MoENewOpen
Google · 128K tokens · self-host
▾
Best for: Faster self-hosted inference, cost-efficient multimodal
How: MoE variant — faster inference than the 31B dense. Same multimodal capabilities.
Example: Process image-based monitoring alerts faster than the dense variant at the same quality.
LMSYS Arena #6 text
MoE efficiencymultimodalimages + videoApache 2.0
Hardware to self-host
VRAM: 18GB (quantized) / 54GB (FP16)
GPU: RTX 4090 24GB or 1× A100 40GB
RAM: 32GB+ system RAM
27B total MoE — faster inference than the 31B dense thanks to sparse activations.
API: Ollama, vLLM, Hugging Face. ollama run gemma4:27b-moe
Gemma 4 E4BNewOpen
Google · 128K tokens · self-host
▾
Best for: Edge, mobile, IoT, on-device AI with multimodal input
How: 4B params — runs on any hardware. Supports images, video, AND native audio input.
Example: Run on a Raspberry Pi to process security camera feeds with voice commands.
tinyon-devicemultimodal + audioApache 2.0
Hardware to self-host
VRAM: 3GB (quantized) / 8GB (FP16)
GPU: Any — CPU, phone, Jetson, Raspberry Pi 5, integrated GPU
RAM: 4-8GB system RAM
4B params. Edge-first design: runs on phones, SBCs, IoT devices.
API: Ollama, Hugging Face. Runs on phones and Raspberry Pi.
ERNIE Image Turbo GGUFNewOpen
unsloth · self-host
▾
Best for: Trending on HuggingFace (151 likes this week)
How: Available on Hugging Face. 14K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("unsloth/ERNIE-Image-Turbo-GGUF")
ggmlgguftext-to-imageunslothbase_model:baidu/ERNIE-Image-Turbo
API: huggingface.co/unsloth/ERNIE-Image-Turbo-GGUF
Auto-discovered from HuggingFace trending. 151 likes, 14K downloads.
Gemma 4 31B It NVFP4 TurboNewOpen
LilaRest · self-host
▾
Best for: Trending on HuggingFace (246 likes this week)
How: Available on Hugging Face. 74K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("LilaRest/gemma-4-31B-it-NVFP4-turbo")
transformerssafetensorsgemma4text-generationgemma-4-31b-it
API: huggingface.co/LilaRest/gemma-4-31B-it-NVFP4-turbo
Auto-discovered from HuggingFace trending. 246 likes, 74K downloads.
Supergemma4 26b Uncensored Mlx 4bit V2NewOpen
Jiunsong · self-host
▾
Best for: Trending on HuggingFace (170 likes this week)
How: Available on Hugging Face. 12K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2")
mlxsafetensorsgemma4uncensoredapple-silicon
API: huggingface.co/Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2
Auto-discovered from HuggingFace trending. 170 likes, 12K downloads.
Gemma 4 E4B It OBLITERATEDNewOpen
OBLITERATUS · self-host
▾
Best for: Trending on HuggingFace (276 likes this week)
How: Available on Hugging Face.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("OBLITERATUS/gemma-4-E4B-it-OBLITERATED")
safetensorsggufgemma4abliterateduncensored
API: huggingface.co/OBLITERATUS/gemma-4-E4B-it-OBLITERATED
Auto-discovered from HuggingFace trending. 276 likes, 7K downloads.
Gemma 4 31B JANG_4M CRACKNewOpen
dealignai · self-host
▾
Best for: Trending on HuggingFace (1258 likes this week)
How: Available on Hugging Face. 153K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("dealignai/Gemma-4-31B-JANG_4M-CRACK")
mlxsafetensorsgemma4abliterateduncensored
API: huggingface.co/dealignai/Gemma-4-31B-JANG_4M-CRACK
Auto-discovered from HuggingFace trending. 1258 likes, 153K downloads.
ERNIE Image TurboNewOpen
baidu · self-host
▾
Best for: Trending on HuggingFace (290 likes this week)
How: Available on Hugging Face.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("baidu/ERNIE-Image-Turbo")
diffuserssafetensorstext-to-image8Blicense:apache-2.0
API: huggingface.co/baidu/ERNIE-Image-Turbo
Auto-discovered from HuggingFace trending. 290 likes, 3K downloads.
Qwen3.6 35B A3B GGUFNewOpen
unsloth · self-host
▾
Best for: Trending on HuggingFace (367 likes this week)
How: Available on Hugging Face. 153K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3.6-35B-A3B-GGUF")
transformersggufunslothqwenqwen3_5_moe
API: huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
Auto-discovered from HuggingFace trending. 367 likes, 153K downloads.
Supergemma4 26b Uncensored Gguf V2NewOpen
Jiunsong · self-host
▾
Best for: Trending on HuggingFace (381 likes this week)
How: Available on Hugging Face. 54K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("Jiunsong/supergemma4-26b-uncensored-gguf-v2")
ggufgemma4uncensoredfastllama.cpp
API: huggingface.co/Jiunsong/supergemma4-26b-uncensored-gguf-v2
Auto-discovered from HuggingFace trending. 381 likes, 54K downloads.
GLM 5.1NewOpen
zai-org · self-host
▾
Best for: Trending on HuggingFace (1383 likes this week)
How: Available on Hugging Face. 100K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-5.1")
transformerssafetensorsglm_moe_dsatext-generationconversational
API: huggingface.co/zai-org/GLM-5.1
Auto-discovered from HuggingFace trending. 1383 likes, 100K downloads.
ERNIE ImageNewOpen
baidu · self-host
▾
Best for: Trending on HuggingFace (425 likes this week)
How: Available on Hugging Face.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("baidu/ERNIE-Image")
diffuserssafetensorstext-to-image8Blicense:apache-2.0
API: huggingface.co/baidu/ERNIE-Image
Auto-discovered from HuggingFace trending. 425 likes, 2K downloads.
HY Embodied 0.5NewOpen
tencent · self-host
▾
Best for: Trending on HuggingFace (852 likes this week)
How: Available on Hugging Face.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("tencent/HY-Embodied-0.5")
transformerssafetensorshunyuan_vl_motimage-text-to-texthunyuan
API: huggingface.co/tencent/HY-Embodied-0.5
Auto-discovered from HuggingFace trending. 852 likes, 1K downloads.
Qwen3.6 35B A3BNewOpen
Qwen · self-host
▾
Best for: Trending on HuggingFace (736 likes this week)
How: Available on Hugging Face. 21K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.6-35B-A3B")
transformerssafetensorsqwen3_5_moeimage-text-to-textconversational
API: huggingface.co/Qwen/Qwen3.6-35B-A3B
Auto-discovered from HuggingFace trending. 736 likes, 21K downloads.
MiniMax M2.7NewOpen
MiniMaxAI · self-host
▾
Best for: Trending on HuggingFace (925 likes this week)
How: Available on Hugging Face. 189K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("MiniMaxAI/MiniMax-M2.7")
transformerssafetensorsminimax_m2text-generationconversational
API: huggingface.co/MiniMaxAI/MiniMax-M2.7
Auto-discovered from HuggingFace trending. 925 likes, 189K downloads.
Nucleus ImageNewOpen
NucleusAI · self-host
▾
Best for: Trending on HuggingFace (159 likes this week)
How: Available on Hugging Face.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("NucleusAI/Nucleus-Image")
diffuserssafetensorsmoesparse-moediffusion
API: huggingface.co/NucleusAI/Nucleus-Image
Auto-discovered from HuggingFace trending. 159 likes, 802 downloads.
Gemma 4 31B ItNewOpen
google · self-host
▾
Best for: Trending on HuggingFace (2122 likes this week)
How: Available on Hugging Face. 3513K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("google/gemma-4-31B-it")
transformerssafetensorsgemma4image-text-to-textconversational
API: huggingface.co/google/gemma-4-31B-it
Auto-discovered from HuggingFace trending. 2122 likes, 3.5M downloads.
DeepSeek V3.2Open
DeepSeek · 164K tokens · self-host
▾
Best for: Long-context coding, upgraded V3 deployments
How: Drop-in upgrade from V3. Uses Dynamic Sparse Attention for better long-context performance.
Example: Feed your entire microservice codebase and get cross-service dependency analysis.
HumanEval 94.0%
codingmathsparse attention (DSA)MIT licenseimproved context
Hardware to self-host
VRAM: 350GB (quantized)
GPU: 8× H100 80GB
RAM: 512GB+ system RAM
Same hardware footprint as V3 — 671B with sparse attention.
API: api.deepseek.com OR self-host via vLLM. Same OpenAI-compatible API.
Mistral Large 3Open
Mistral · 256K tokens · self-host
▾
Best for: European deployments, agent workflows, long-context multilingual apps
How: Major upgrade from Large 2. MoE architecture with 41B active params. Same API, just change model ID.
Example: Build a multi-tool agent that queries DBs, calls APIs, and generates reports in 30+ languages.
MoE 41B active / 675B totalmultilingualfunction calling256K context
Hardware to self-host
VRAM: 350GB (quantized)
GPU: 8× H100 80GB
RAM: 512GB+ system RAM
675B MoE (41B active). Datacenter class — most users go via api.mistral.ai.
API: api.mistral.ai OR self-host via vLLM. OpenAI-compatible.
Ministral 3 (3B/8B/14B)Open
Mistral · 128K tokens · self-host
▾
Best for: Edge deployment, on-device AI, lightweight vision tasks
How: 3B fits on phones, 8B on laptops, 14B on dev GPUs. All have vision support.
Example: Run 8B on a Jetson to classify manufacturing defects from camera feeds.
edge-friendlyvisiondense3 sizes
Hardware to self-host
VRAM: 2GB (3B) / 6GB (8B) / 10GB (14B quantized)
GPU: Phone/CPU (3B) · Laptop GPU (8B) · RTX 3060+ (14B)
RAM: 8-16GB system RAM
All three sizes are dense with vision. 3B runs on phones, 8B on laptops, 14B on dev GPUs.
API: Ollama, vLLM, Hugging Face. Also on Mistral API.
Kimi K2.5
Moonshot AI · 256K tokens · $0.55/M → $2.19/M
▾
Best for: Budget alternative to flagship models, Chinese language tasks
How: OpenAI SDK with base_url='https://api.moonshot.ai/v1'. WARNING: has implicit reasoning that eats max_tokens.
Example: Use moonshot-v1-8k instead for structured JSON tasks — kimi-k2.5 wastes tokens on hidden thinking.
reasoningmultimodalcheap
API: api.moonshot.ai — OpenAI-compatible
Watch:hidden thinking burns tokenstemperature locked to 1
Claude Opus 4.6
Anthropic · 1M tokens · $15/M → $75/M
▾
Best for: Complex multi-step coding, large codebase refactors, long-document analysis
How: Best via Claude Code CLI for coding tasks. For API: messages.create() with system prompt + tools.
Example: claude-code: point it at a repo, describe the feature, it reads/edits/tests autonomously.
SWE-bench 72.5%GPQA Diamond 74.9%HumanEval 95.4%
reasoninglong contexttool useagentic workflowscode generation
API: api.anthropic.com — SDK: pip install anthropic / npm i @anthropic-ai/sdk
Claude Sonnet 4.6
Anthropic · 200K tokens · $3/M → $15/M
▾
Best for: Production API backends, real-time chat, moderate complexity coding
How: Drop-in replacement for Opus when you need faster/cheaper. Same API, just change model ID.
Example: Use as the default model in your API gateway — upgrade to Opus only for hard problems.
SWE-bench 65.2%HumanEval 93.8%
speedcost-efficiencycodingtool use
API: api.anthropic.com — same SDK as Opus
Gemini 2.5 Flash
Google · 1M tokens · $0.15/M → $0.60/M
▾
Best for: High-volume processing, real-time apps, budget-conscious pipelines
How: Set thinking_budget to control reasoning cost. 0 = no thinking, 24576 = max.
Example: Summarize 1000 GitHub issues per hour for a triage dashboard at ~$1.
speedcostlong contextthinking budget control
API: Same SDK as Gemini Pro. model='gemini-2.5-flash-preview-05-20'
Claude Haiku 4.5
Anthropic · 200K tokens · $0.80/M → $4/M
▾
Best for: Pipelines, batch processing, structured data extraction, routing
How: Use for high-volume, low-complexity tasks: classification, extraction, summarization.
Example: Process 10K support tickets per hour to classify priority and extract entities.
HumanEval 88.5%
speedcoststructured outputclassification
API: api.anthropic.com — same SDK
GPT-4.1
OpenAI · 1M tokens · $2/M → $8/M
▾
Best for: General-purpose API integration, multimodal apps, coding assistance
How: client.chat.completions.create(model='gpt-4.1', messages=[...]). Supports vision, tools, JSON mode.
Example: Build a PR review bot that reads diffs + screenshots and posts comments.
SWE-bench 54.6%HumanEval 95.3%
codinginstruction followinglong contextmultimodal
API: api.openai.com — SDK: pip install openai / npm i openai
GPT-4.1 mini
OpenAI · 1M tokens · $0.40/M → $1.60/M
▾
Best for: Embeddings preprocessing, log parsing, lightweight generation
How: Same API as GPT-4.1. Best for high-volume, simple tasks where cost matters.
Example: Parse 50K structured logs per hour and extract error patterns.
SWE-bench 28.8%HumanEval 92.5%
costspeedlong context
API: api.openai.com — same SDK
GPT-4.1 nano
OpenAI · 1M tokens · $0.10/M → $0.40/M
▾
Best for: Intent classification, entity extraction at massive scale
How: Use for routing, tagging, simple extraction where quality bar is lower.
Example: Route 1M incoming messages per day to the right service for $4 total.
ultra-cheapfastclassification
API: api.openai.com — same SDK
o3
OpenAI · 200K tokens · $2/M → $8/M
▾
Best for: Hard math, science, multi-step planning, complex debugging
How: Use reasoning_effort param: 'low'/'medium'/'high'. No system prompt — use developer message instead.
Example: Debug a distributed system deadlock by feeding it the full trace + architecture.
GPQA Diamond 79.7%AIME 2024 96.7%SWE-bench 69.1%
reasoningmathscienceplanning
API: api.openai.com — same SDK, just model='o3'
o4-mini
OpenAI · 200K tokens · $1.10/M → $4.40/M
▾
Best for: Coding with reasoning, moderate-complexity math, budget reasoning
How: Cheaper reasoning model. Use when o3 is overkill but you need chain-of-thought.
Example: Generate a migration plan for a database schema change with safety checks.
AIME 2024 93.4%SWE-bench 68.1%
reasoningcodingcost-efficient reasoning
API: api.openai.com — same SDK
Llama 4 MaverickOpen
Meta · 1M tokens · self-host
▾
Best for: Self-hosted production deployments, privacy-sensitive workloads
How: ollama run llama4-maverick OR deploy on vLLM with tensor parallelism. Also available hosted on Together/Groq.
Example: Deploy on 2x A100 GPUs behind your API gateway for private code review.
MMLU 88.4%HumanEval 84.8%
multilingualmultimodalMoE architecture17B active / 400B total
Hardware to self-host
VRAM: 200GB (quantized)
GPU: 2× H100 80GB or 4× A100 80GB
RAM: 256GB system RAM
400B total params (17B active). FP16 needs ~800GB, FP8 ~400GB, INT4 ~200GB.
API: Self-host via vLLM, Ollama, or use via Together, Fireworks, Groq
Llama 4 ScoutOpen
Meta · 10M tokens · self-host
▾
Best for: Processing entire codebases, very long documents, single-GPU deployments
How: Fits on a single H100. Best open model for extreme context lengths.
Example: Feed your entire monorepo into context and ask about cross-service dependencies.
MMLU 86.2%
longest context (10M)MoE 17B active / 109B totalfits single H100
Hardware to self-host
VRAM: 80GB
GPU: 1× H100 80GB
RAM: 128GB system RAM
17B active params, fits in a single H100 at FP8.
API: Same as Maverick — vLLM, Ollama, Together, Fireworks
Qwen 3 235BOpen
Alibaba · 128K tokens · self-host
▾
Best for: Flexible thinking control, commercial self-hosting, multilingual
How: Supports /think and /no_think tags to toggle reasoning on/off per request. Apache 2.0 = fully commercial.
Example: Use /no_think for fast classification, /think for complex debugging — same model.
AIME 2024 85.7%HumanEval 90.2%
hybrid thinkingMoE 22B activeApache 2.0multilingual
Hardware to self-host
VRAM: 140GB (quantized)
GPU: 4× A100 80GB or 2× H100
RAM: 256GB+ system RAM
235B total (22B active). MoE architecture — only 22B params active per forward pass.
API: Self-host via vLLM/SGLang or use via Together, Fireworks. Also on Alibaba Cloud.
Qwen 3 30BOpen
Alibaba · 128K tokens · self-host
▾
Best for: Local development, laptop-friendly reasoning, privacy
How: Excellent for local dev. MoE means only 3B params active — fast on consumer hardware.
Example: Run on your dev machine as a private coding assistant with reasoning.
AIME 2024 66.7%
MoE 3B active / 30B totalruns on consumer GPUhybrid thinking
Hardware to self-host
VRAM: 20GB (quantized) / 60GB (FP16)
GPU: RTX 4090 24GB (quantized) or 1× A100
RAM: 32GB+ system RAM
30B total (3B active). The 3B active params make inference fast on consumer hardware.
API: ollama run qwen3:30b — fits on RTX 4090 (24GB)
GPT-Image-1
OpenAI · N/A · $5/M tokens → $40/M tokens
▾
Best for: UI mockups, marketing assets, diagrams with text
How: Supports text overlays, inpainting, and style control. Best text rendering of any model.
Example: Generate architecture diagrams with accurate labels from a text description.
text renderinginstruction followingediting
API: api.openai.com — client.images.generate(model='gpt-image-1')
Gemini 2.5 Pro
Google · 1M tokens · $1.25/M → $10/M
▾
Best for: Long-document analysis, multimodal tasks, apps needing search grounding
How: client.models.generate_content(model='gemini-2.5-pro', contents=[...]). Supports grounding with Google Search.
Example: Feed a 200-page architecture doc and ask it to find security issues.
SWE-bench 63.8%GPQA Diamond 67.2%
multimodallong contextsearch groundingcode generation
API: generativelanguage.googleapis.com — SDK: pip install google-genai
Gemma 3 27BOpen
Google · 128K tokens · self-host
▾
Best for: On-device/edge deployment, multimodal at small scale
How: ollama run gemma3:27b. Fits on RTX 3090/4090. Good multimodal + tool use at small size.
Example: Run on a dev server to process screenshots and generate bug reports.
MMLU 75.6%HumanEval 78.0%
compactmultimodalruns on single GPUfunction calling
Hardware to self-host
VRAM: 18GB (quantized) / 54GB (FP16)
GPU: RTX 3090/4090 24GB or 1× A100 40GB
RAM: 32GB+ system RAM
27B dense. Fits on a single high-end consumer GPU with quantization.
API: Ollama, vLLM, Hugging Face. Also on Vertex AI.
Grok 3
xAI · 128K tokens · $3/M → $15/M
▾
Best for: Tasks needing real-time information, math-heavy problems
How: OpenAI SDK with base_url override. Also supports live search via tools.
Example: Monitor real-time tech news and generate summaries using live search.
GPQA Diamond 68.2%AIME 2024 93.3%
reasoningreal-time datamath
API: api.x.ai — OpenAI-compatible SDK. Set base_url='https://api.x.ai/v1'
Grok 3 mini
xAI · 128K tokens · $0.30/M → $0.50/M
▾
Best for: Budget reasoning tasks, math, lightweight chain-of-thought
How: Excellent cost-to-reasoning ratio. Use reasoning_effort param.
Example: Validate Terraform plans with reasoning about dependency chains for pennies.
fast reasoningvery cheapmath
API: api.x.ai — same as Grok 3
Nomic Embed Text v2-MoEOpen
Nomic AI · 8K tokens · self-host
▾
Best for: Self-hosted RAG, privacy-first search, zero-cost embeddings
How: Self-host for zero cost. Comparable quality to OpenAI embeddings.
Example: Run alongside pgvector on the same server — full RAG pipeline with zero API costs.
MoE embeddingmatryoshkaApache 2.0self-hostable
Hardware to self-host
VRAM: 2GB or CPU-only
GPU: Any — runs on CPU at reasonable speed
RAM: 4-8GB system RAM
Tiny MoE embedding model. CPU inference is fast enough for most use cases.
API: pip install nomic OR Ollama. Also hosted on Nomic Atlas.
DeepSeek R1Open
DeepSeek · 128K tokens · self-host
▾
Best for: Budget reasoning, self-hosted chain-of-thought, research
How: API is OpenAI-compatible. Self-host the 70B distill on 2x A100. MIT license = no restrictions.
Example: Run the 14B distill locally for debugging complex distributed system issues.
AIME 2024 79.8%SWE-bench 49.2%GPQA Diamond 71.5%
reasoningmathcodingMIT licensedistillable
Hardware to self-host
VRAM: 10GB (14B distill) / 48GB (70B distill) / 1TB+ (full 671B)
GPU: RTX 4090 (14B) · 2× A100 (70B) · 8× H100 (full)
RAM: Full model needs 256GB+ system RAM
Full 671B MoE is massive. Distilled versions (14B, 32B, 70B) are far more practical.
API: api.deepseek.com ($0.55/M in, $2.19/M out) OR self-host via vLLM/Ollama
Codestral 25.01Open
Mistral · 256K tokens · self-host
▾
Best for: Code completion, inline suggestions, editor integration
How: Supports FIM for inline completion. Integrate with any editor via LSP or Continue.dev.
Example: Deploy as your team's FIM-capable completion server behind an LSP proxy.
HumanEval 91.0%
code completionFIM (fill-in-middle)80+ languages
Hardware to self-host
VRAM: 16GB (quantized) / 45GB (FP16)
GPU: RTX 4090 24GB or 1× A100 40GB
RAM: 32GB+ system RAM
22B dense. Fits on a single consumer GPU with quantization.
API: codestral.mistral.ai — dedicated code endpoint
Llama 3.3 70BOpen
Meta · 128K tokens · self-host
▾
Best for: Proven workhorse for self-hosted deployments, fine-tuning base
How: ollama run llama3.3:70b. For production: vLLM on 2x A100 or 4x A10G.
Example: Fine-tune on your internal docs for a private knowledge base chatbot.
MMLU 86.0%HumanEval 88.4%
mature ecosystemfine-tuning friendlywide hardware support
Hardware to self-host
VRAM: 40GB (4-bit) / 140GB (FP16)
GPU: 2× A100 80GB or 4× A10G 24GB
RAM: 64GB+ system RAM
70B dense. Widely supported — runs on Ollama with quantization on 48GB VRAM.
API: Ollama, vLLM, TGI, or hosted (Together $0.60/M, Groq, Fireworks)
DeepSeek V3Open
DeepSeek · 128K tokens · self-host
▾
Best for: Cost-sensitive production APIs, coding tasks, math-heavy pipelines
How: Cheapest top-tier API. OpenAI-compatible. Self-host needs 8x A100.
Example: Replace GPT-4 in your CI pipeline for automated code review at 1/10th the cost.
HumanEval 92.1%MMLU 88.5%
codingmathMoE 37B active / 671B totalMIT license
Hardware to self-host
VRAM: 350GB (quantized) / 1.3TB (FP16)
GPU: 8× H100 80GB or 8× A100 80GB
RAM: 512GB+ system RAM
671B total (37B active). Most users rent via API — self-hosting needs datacenter hardware.
API: api.deepseek.com ($0.27/M in, $1.10/M out) OR self-host
Phi-4Open
Microsoft · 16K tokens · self-host
▾
Best for: Edge deployment, STEM tasks, embedded AI in products
How: ollama run phi4. MIT license — embed in commercial products freely.
Example: Embed in a CI pipeline to validate config files and Terraform plans.
GPQA Diamond 56.2%MATH 80.4%
14B paramsSTEM reasoningMIT licenseruns on laptop
Hardware to self-host
VRAM: 9GB (quantized) / 28GB (FP16)
GPU: Any 8GB+ GPU (RTX 3060, laptop 4050, etc.)
RAM: 16GB system RAM
14B dense. Runs locally on most developer laptops with quantization.
API: Ollama, Hugging Face, Azure AI
Qwen 2.5 Coder 32BOpen
Alibaba · 128K tokens · self-host
▾
Best for: Private code completion, self-hosted Copilot replacement
How: ollama run qwen2.5-coder:32b. Plug into Continue.dev or Copilot alternatives.
Example: Set up as your team's private code completion backend — zero data leaves your infra.
HumanEval 92.7%LiveCodeBench 48.5%
code completioncode generationApache 2.0
Hardware to self-host
VRAM: 20GB (quantized) / 64GB (FP16)
GPU: RTX 4090 24GB or 1× A100 40GB
RAM: 32GB+ system RAM
32B dense. Fits on a single consumer GPU with 4-bit quantization.
API: Ollama, vLLM, or hosted on Together/Fireworks
Flux.1 Pro
Black Forest Labs · N/A · $0.05/image → N/A
▾
Best for: High-quality image generation, product photography
How: API or self-host Flux.1 Schnell (open). Pro via API only.
Example: Generate product mockups for landing pages programmatically.
photorealismprompt adherencecommercial license
API: api.bfl.ml OR via Replicate, fal.ai
Moonshot v1 (8K/32K/128K)
Moonshot AI · 8K / 32K / 128K tokens · $0.14/M → $0.28/M
▾
Best for: Batch processing, structured extraction, JSON pipelines
How: Best for structured output tasks. Supports response_format: json_object. No reasoning overhead.
Example: Process RSS feeds into structured summaries for pennies per 1000 articles.
very cheapno hidden reasoningreliable JSON
API: api.moonshot.ai — OpenAI-compatible. model='moonshot-v1-8k'
text-embedding-3-large
OpenAI · 8K tokens · $0.13/M → N/A
▾
Best for: RAG pipelines, semantic search, document retrieval
How: Set dimensions param to reduce size (e.g., 256 for fast search, 3072 for max quality).
Example: Index your internal docs and build a search API with pgvector + this model.
3072 dimensionsstrong retrievalmatryoshka support
API: api.openai.com — client.embeddings.create(model='text-embedding-3-large')
GPT-Rosalind
OpenAI · N/A · api
▾
Best for: life sciences research
How: N/A
Example: N/A
accelerate drug discoverygenomics analysisprotein reasoningscientific research workflows
Auto-discovered from news articles.