AI Models
52 models · 8 new in 60d
- ▾Claude Opus 4.7New
Anthropic · 1M tokens · $5/M → $25/M
Best for: Most capable generally available model. Complex multi-step coding, long agentic workflows, 1M-token codebase reads.
How: client.messages.create(model='claude-opus-4-7', ...). Adaptive thinking is on by default — no separate extended-thinking mode needed.
Example: Use Claude Code CLI with --model claude-opus-4-7 to handle PR-sized refactors end-to-end in a single run.
SWE-bench step-change over Opus 4.6Context 1M (~555k words)agentic codingnew tokenizeradaptive thinking1M context128k max outputAPI: api.anthropic.com (model: claude-opus-4-7) · AWS Bedrock · GCP Vertex AI · Microsoft Foundry
Step-change improvement in agentic coding vs Opus 4.6. New tokenizer means 1M tokens ≈ 555k words (vs 750k for Sonnet 4.6).
- ▾Gemma 4 31B DenseNewOpen
Google · 256K tokens · self-host
Best for: Self-hosted multimodal production, commercial use, multilingual apps
How: Dense 31B — fits on a single A100 or 2x RTX 4090. Apache 2.0 = fully commercial. Supports images and video natively.
Example: Deploy as a private multimodal assistant that reads screenshots, logs, and video clips.
LMSYS Arena #3 textMMLU ~82%multimodalimages + video35+ languagesApache 2.0dense architectureHardware to self-hostVRAM: 20GB (quantized) / 62GB (FP16)GPU: 1× A100 80GB or 2× RTX 4090 24GBRAM: 32GB+ system RAM31B dense. Native multimodal (images + video) increases compute cost vs text-only.
API: Ollama, vLLM, Hugging Face, Vertex AI. ollama run gemma4:31b
Brand new (Apr 2026). Ranked #3 on LMSYS Arena text leaderboard at launch.
- ▾Gemma 4 31B It NVFP4 TurboNewOpen
LilaRest · self-host
Best for: Trending on HuggingFace (246 likes this week)
How: Available on Hugging Face. 74K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("LilaRest/gemma-4-31B-it-NVFP4-turbo")
transformerssafetensorsgemma4text-generationgemma-4-31b-itAPI: huggingface.co/LilaRest/gemma-4-31B-it-NVFP4-turbo
Auto-discovered from HuggingFace trending. 246 likes, 74K downloads.
- ▾Supergemma4 26b Uncensored Mlx 4bit V2NewOpen
Jiunsong · self-host
Best for: Trending on HuggingFace (170 likes this week)
How: Available on Hugging Face. 12K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2")
mlxsafetensorsgemma4uncensoredapple-siliconAPI: huggingface.co/Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2
Auto-discovered from HuggingFace trending. 170 likes, 12K downloads.
- ▾Gemma 4 E4B It OBLITERATEDNewOpen
OBLITERATUS · self-host
Best for: Trending on HuggingFace (276 likes this week)
How: Available on Hugging Face.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("OBLITERATUS/gemma-4-E4B-it-OBLITERATED")
safetensorsggufgemma4abliterateduncensoredAPI: huggingface.co/OBLITERATUS/gemma-4-E4B-it-OBLITERATED
Auto-discovered from HuggingFace trending. 276 likes, 7K downloads.
- ▾Supergemma4 26b Uncensored Gguf V2NewOpen
Jiunsong · self-host
Best for: Trending on HuggingFace (381 likes this week)
How: Available on Hugging Face. 54K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("Jiunsong/supergemma4-26b-uncensored-gguf-v2")
ggufgemma4uncensoredfastllama.cppAPI: huggingface.co/Jiunsong/supergemma4-26b-uncensored-gguf-v2
Auto-discovered from HuggingFace trending. 381 likes, 54K downloads.
- ▾GLM 5.1NewOpen
zai-org · self-host
Best for: Trending on HuggingFace (1383 likes this week)
How: Available on Hugging Face. 100K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-5.1")
transformerssafetensorsglm_moe_dsatext-generationconversationalAPI: huggingface.co/zai-org/GLM-5.1
Auto-discovered from HuggingFace trending. 1383 likes, 100K downloads.
- ▾MiniMax M2.7NewOpen
MiniMaxAI · self-host
Best for: Trending on HuggingFace (925 likes this week)
How: Available on Hugging Face. 189K downloads.
Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("MiniMaxAI/MiniMax-M2.7")
transformerssafetensorsminimax_m2text-generationconversationalAPI: huggingface.co/MiniMaxAI/MiniMax-M2.7
Auto-discovered from HuggingFace trending. 925 likes, 189K downloads.
- ▾DeepSeek V3.2Open
DeepSeek · 164K tokens · self-host
Best for: Long-context coding, upgraded V3 deployments
How: Drop-in upgrade from V3. Uses Dynamic Sparse Attention for better long-context performance.
Example: Feed your entire microservice codebase and get cross-service dependency analysis.
HumanEval 94.0%codingmathsparse attention (DSA)MIT licenseimproved contextHardware to self-hostVRAM: 350GB (quantized)GPU: 8× H100 80GBRAM: 512GB+ system RAMSame hardware footprint as V3 — 671B with sparse attention.
API: api.deepseek.com OR self-host via vLLM. Same OpenAI-compatible API.
- ▾Mistral Large 3Open
Mistral · 256K tokens · self-host
Best for: European deployments, agent workflows, long-context multilingual apps
How: Major upgrade from Large 2. MoE architecture with 41B active params. Same API, just change model ID.
Example: Build a multi-tool agent that queries DBs, calls APIs, and generates reports in 30+ languages.
MoE 41B active / 675B totalmultilingualfunction calling256K contextHardware to self-hostVRAM: 350GB (quantized)GPU: 8× H100 80GBRAM: 512GB+ system RAM675B MoE (41B active). Datacenter class — most users go via api.mistral.ai.
API: api.mistral.ai OR self-host via vLLM. OpenAI-compatible.
- ▾Kimi K2.5
Moonshot AI · 256K tokens · $0.55/M → $2.19/M
Best for: Budget alternative to flagship models, Chinese language tasks
How: OpenAI SDK with base_url='https://api.moonshot.ai/v1'. WARNING: has implicit reasoning that eats max_tokens.
Example: Use moonshot-v1-8k instead for structured JSON tasks — kimi-k2.5 wastes tokens on hidden thinking.
reasoningmultimodalcheapAPI: api.moonshot.ai — OpenAI-compatible
Watch:hidden thinking burns tokenstemperature locked to 1 - ▾Claude Opus 4.6
Anthropic · 1M tokens · $15/M → $75/M
Best for: Complex multi-step coding, large codebase refactors, long-document analysis
How: Best via Claude Code CLI for coding tasks. For API: messages.create() with system prompt + tools.
Example: claude-code: point it at a repo, describe the feature, it reads/edits/tests autonomously.
SWE-bench 72.5%GPQA Diamond 74.9%HumanEval 95.4%reasoninglong contexttool useagentic workflowscode generationAPI: api.anthropic.com — SDK: pip install anthropic / npm i @anthropic-ai/sdk
- ▾Claude Sonnet 4.6
Anthropic · 200K tokens · $3/M → $15/M
Best for: Production API backends, real-time chat, moderate complexity coding
How: Drop-in replacement for Opus when you need faster/cheaper. Same API, just change model ID.
Example: Use as the default model in your API gateway — upgrade to Opus only for hard problems.
SWE-bench 65.2%HumanEval 93.8%speedcost-efficiencycodingtool useAPI: api.anthropic.com — same SDK as Opus
- ▾GPT-4.1
OpenAI · 1M tokens · $2/M → $8/M
Best for: General-purpose API integration, multimodal apps, coding assistance
How: client.chat.completions.create(model='gpt-4.1', messages=[...]). Supports vision, tools, JSON mode.
Example: Build a PR review bot that reads diffs + screenshots and posts comments.
SWE-bench 54.6%HumanEval 95.3%codinginstruction followinglong contextmultimodalAPI: api.openai.com — SDK: pip install openai / npm i openai
- ▾Llama 4 MaverickOpen
Meta · 1M tokens · self-host
Best for: Self-hosted production deployments, privacy-sensitive workloads
How: ollama run llama4-maverick OR deploy on vLLM with tensor parallelism. Also available hosted on Together/Groq.
Example: Deploy on 2x A100 GPUs behind your API gateway for private code review.
MMLU 88.4%HumanEval 84.8%multilingualmultimodalMoE architecture17B active / 400B totalHardware to self-hostVRAM: 200GB (quantized)GPU: 2× H100 80GB or 4× A100 80GBRAM: 256GB system RAM400B total params (17B active). FP16 needs ~800GB, FP8 ~400GB, INT4 ~200GB.
API: Self-host via vLLM, Ollama, or use via Together, Fireworks, Groq
- ▾Llama 4 ScoutOpen
Meta · 10M tokens · self-host
Best for: Processing entire codebases, very long documents, single-GPU deployments
How: Fits on a single H100. Best open model for extreme context lengths.
Example: Feed your entire monorepo into context and ask about cross-service dependencies.
MMLU 86.2%longest context (10M)MoE 17B active / 109B totalfits single H100Hardware to self-hostVRAM: 80GBGPU: 1× H100 80GBRAM: 128GB system RAM17B active params, fits in a single H100 at FP8.
API: Same as Maverick — vLLM, Ollama, Together, Fireworks
- ▾Qwen 3 235BOpen
Alibaba · 128K tokens · self-host
Best for: Flexible thinking control, commercial self-hosting, multilingual
How: Supports /think and /no_think tags to toggle reasoning on/off per request. Apache 2.0 = fully commercial.
Example: Use /no_think for fast classification, /think for complex debugging — same model.
AIME 2024 85.7%HumanEval 90.2%hybrid thinkingMoE 22B activeApache 2.0multilingualHardware to self-hostVRAM: 140GB (quantized)GPU: 4× A100 80GB or 2× H100RAM: 256GB+ system RAM235B total (22B active). MoE architecture — only 22B params active per forward pass.
API: Self-host via vLLM/SGLang or use via Together, Fireworks. Also on Alibaba Cloud.
- ▾Gemini 2.5 Pro
Google · 1M tokens · $1.25/M → $10/M
Best for: Long-document analysis, multimodal tasks, apps needing search grounding
How: client.models.generate_content(model='gemini-2.5-pro', contents=[...]). Supports grounding with Google Search.
Example: Feed a 200-page architecture doc and ask it to find security issues.
SWE-bench 63.8%GPQA Diamond 67.2%multimodallong contextsearch groundingcode generationAPI: generativelanguage.googleapis.com — SDK: pip install google-genai
- ▾Grok 3
xAI · 128K tokens · $3/M → $15/M
Best for: Tasks needing real-time information, math-heavy problems
How: OpenAI SDK with base_url override. Also supports live search via tools.
Example: Monitor real-time tech news and generate summaries using live search.
GPQA Diamond 68.2%AIME 2024 93.3%reasoningreal-time datamathAPI: api.x.ai — OpenAI-compatible SDK. Set base_url='https://api.x.ai/v1'
- ▾Llama 3.3 70BOpen
Meta · 128K tokens · self-host
Best for: Proven workhorse for self-hosted deployments, fine-tuning base
How: ollama run llama3.3:70b. For production: vLLM on 2x A100 or 4x A10G.
Example: Fine-tune on your internal docs for a private knowledge base chatbot.
MMLU 86.0%HumanEval 88.4%mature ecosystemfine-tuning friendlywide hardware supportHardware to self-hostVRAM: 40GB (4-bit) / 140GB (FP16)GPU: 2× A100 80GB or 4× A10G 24GBRAM: 64GB+ system RAM70B dense. Widely supported — runs on Ollama with quantization on 48GB VRAM.
API: Ollama, vLLM, TGI, or hosted (Together $0.60/M, Groq, Fireworks)
- ▾DeepSeek V3Open
DeepSeek · 128K tokens · self-host
Best for: Cost-sensitive production APIs, coding tasks, math-heavy pipelines
How: Cheapest top-tier API. OpenAI-compatible. Self-host needs 8x A100.
Example: Replace GPT-4 in your CI pipeline for automated code review at 1/10th the cost.
HumanEval 92.1%MMLU 88.5%codingmathMoE 37B active / 671B totalMIT licenseHardware to self-hostVRAM: 350GB (quantized) / 1.3TB (FP16)GPU: 8× H100 80GB or 8× A100 80GBRAM: 512GB+ system RAM671B total (37B active). Most users rent via API — self-hosting needs datacenter hardware.
API: api.deepseek.com ($0.27/M in, $1.10/M out) OR self-host