AI Models

52 models · 18 new in 60d

Compare →
  • Gemma 4 31B DenseNewOpen

    Google · 256K tokens · self-host

    Best for: Self-hosted multimodal production, commercial use, multilingual apps

    How: Dense 31B — fits on a single A100 or 2x RTX 4090. Apache 2.0 = fully commercial. Supports images and video natively.

    Example: Deploy as a private multimodal assistant that reads screenshots, logs, and video clips.

    LMSYS Arena #3 textMMLU ~82%
    multimodalimages + video35+ languagesApache 2.0dense architecture
    Hardware to self-host
    VRAM: 20GB (quantized) / 62GB (FP16)
    GPU: 1× A100 80GB or 2× RTX 4090 24GB
    RAM: 32GB+ system RAM

    31B dense. Native multimodal (images + video) increases compute cost vs text-only.

    API: Ollama, vLLM, Hugging Face, Vertex AI. ollama run gemma4:31b

    Brand new (Apr 2026). Ranked #3 on LMSYS Arena text leaderboard at launch.

  • Gemma 4 27B MoENewOpen

    Google · 128K tokens · self-host

    Best for: Faster self-hosted inference, cost-efficient multimodal

    How: MoE variant — faster inference than the 31B dense. Same multimodal capabilities.

    Example: Process image-based monitoring alerts faster than the dense variant at the same quality.

    LMSYS Arena #6 text
    MoE efficiencymultimodalimages + videoApache 2.0
    Hardware to self-host
    VRAM: 18GB (quantized) / 54GB (FP16)
    GPU: RTX 4090 24GB or 1× A100 40GB
    RAM: 32GB+ system RAM

    27B total MoE — faster inference than the 31B dense thanks to sparse activations.

    API: Ollama, vLLM, Hugging Face. ollama run gemma4:27b-moe

  • Gemma 4 E4BNewOpen

    Google · 128K tokens · self-host

    Best for: Edge, mobile, IoT, on-device AI with multimodal input

    How: 4B params — runs on any hardware. Supports images, video, AND native audio input.

    Example: Run on a Raspberry Pi to process security camera feeds with voice commands.

    tinyon-devicemultimodal + audioApache 2.0
    Hardware to self-host
    VRAM: 3GB (quantized) / 8GB (FP16)
    GPU: Any — CPU, phone, Jetson, Raspberry Pi 5, integrated GPU
    RAM: 4-8GB system RAM

    4B params. Edge-first design: runs on phones, SBCs, IoT devices.

    API: Ollama, Hugging Face. Runs on phones and Raspberry Pi.

  • ERNIE Image Turbo GGUFNewOpen

    unsloth · self-host

    Best for: Trending on HuggingFace (151 likes this week)

    How: Available on Hugging Face. 14K downloads.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("unsloth/ERNIE-Image-Turbo-GGUF")

    ggmlgguftext-to-imageunslothbase_model:baidu/ERNIE-Image-Turbo

    API: huggingface.co/unsloth/ERNIE-Image-Turbo-GGUF

    Auto-discovered from HuggingFace trending. 151 likes, 14K downloads.

  • Gemma 4 31B It NVFP4 TurboNewOpen

    LilaRest · self-host

    Best for: Trending on HuggingFace (246 likes this week)

    How: Available on Hugging Face. 74K downloads.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("LilaRest/gemma-4-31B-it-NVFP4-turbo")

    transformerssafetensorsgemma4text-generationgemma-4-31b-it

    API: huggingface.co/LilaRest/gemma-4-31B-it-NVFP4-turbo

    Auto-discovered from HuggingFace trending. 246 likes, 74K downloads.

  • Supergemma4 26b Uncensored Mlx 4bit V2NewOpen

    Jiunsong · self-host

    Best for: Trending on HuggingFace (170 likes this week)

    How: Available on Hugging Face. 12K downloads.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2")

    mlxsafetensorsgemma4uncensoredapple-silicon

    API: huggingface.co/Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2

    Auto-discovered from HuggingFace trending. 170 likes, 12K downloads.

  • Gemma 4 E4B It OBLITERATEDNewOpen

    OBLITERATUS · self-host

    Best for: Trending on HuggingFace (276 likes this week)

    How: Available on Hugging Face.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("OBLITERATUS/gemma-4-E4B-it-OBLITERATED")

    safetensorsggufgemma4abliterateduncensored

    API: huggingface.co/OBLITERATUS/gemma-4-E4B-it-OBLITERATED

    Auto-discovered from HuggingFace trending. 276 likes, 7K downloads.

  • Gemma 4 31B JANG_4M CRACKNewOpen

    dealignai · self-host

    Best for: Trending on HuggingFace (1258 likes this week)

    How: Available on Hugging Face. 153K downloads.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("dealignai/Gemma-4-31B-JANG_4M-CRACK")

    mlxsafetensorsgemma4abliterateduncensored

    API: huggingface.co/dealignai/Gemma-4-31B-JANG_4M-CRACK

    Auto-discovered from HuggingFace trending. 1258 likes, 153K downloads.

  • ERNIE Image TurboNewOpen

    baidu · self-host

    Best for: Trending on HuggingFace (290 likes this week)

    How: Available on Hugging Face.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("baidu/ERNIE-Image-Turbo")

    diffuserssafetensorstext-to-image8Blicense:apache-2.0

    API: huggingface.co/baidu/ERNIE-Image-Turbo

    Auto-discovered from HuggingFace trending. 290 likes, 3K downloads.

  • Qwen3.6 35B A3B GGUFNewOpen

    unsloth · self-host

    Best for: Trending on HuggingFace (367 likes this week)

    How: Available on Hugging Face. 153K downloads.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3.6-35B-A3B-GGUF")

    transformersggufunslothqwenqwen3_5_moe

    API: huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

    Auto-discovered from HuggingFace trending. 367 likes, 153K downloads.

  • Supergemma4 26b Uncensored Gguf V2NewOpen

    Jiunsong · self-host

    Best for: Trending on HuggingFace (381 likes this week)

    How: Available on Hugging Face. 54K downloads.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("Jiunsong/supergemma4-26b-uncensored-gguf-v2")

    ggufgemma4uncensoredfastllama.cpp

    API: huggingface.co/Jiunsong/supergemma4-26b-uncensored-gguf-v2

    Auto-discovered from HuggingFace trending. 381 likes, 54K downloads.

  • GLM 5.1NewOpen

    zai-org · self-host

    Best for: Trending on HuggingFace (1383 likes this week)

    How: Available on Hugging Face. 100K downloads.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-5.1")

    transformerssafetensorsglm_moe_dsatext-generationconversational

    API: huggingface.co/zai-org/GLM-5.1

    Auto-discovered from HuggingFace trending. 1383 likes, 100K downloads.

  • ERNIE ImageNewOpen

    baidu · self-host

    Best for: Trending on HuggingFace (425 likes this week)

    How: Available on Hugging Face.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("baidu/ERNIE-Image")

    diffuserssafetensorstext-to-image8Blicense:apache-2.0

    API: huggingface.co/baidu/ERNIE-Image

    Auto-discovered from HuggingFace trending. 425 likes, 2K downloads.

  • HY Embodied 0.5NewOpen

    tencent · self-host

    Best for: Trending on HuggingFace (852 likes this week)

    How: Available on Hugging Face.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("tencent/HY-Embodied-0.5")

    transformerssafetensorshunyuan_vl_motimage-text-to-texthunyuan

    API: huggingface.co/tencent/HY-Embodied-0.5

    Auto-discovered from HuggingFace trending. 852 likes, 1K downloads.

  • Qwen3.6 35B A3BNewOpen

    Qwen · self-host

    Best for: Trending on HuggingFace (736 likes this week)

    How: Available on Hugging Face. 21K downloads.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.6-35B-A3B")

    transformerssafetensorsqwen3_5_moeimage-text-to-textconversational

    API: huggingface.co/Qwen/Qwen3.6-35B-A3B

    Auto-discovered from HuggingFace trending. 736 likes, 21K downloads.

  • MiniMax M2.7NewOpen

    MiniMaxAI · self-host

    Best for: Trending on HuggingFace (925 likes this week)

    How: Available on Hugging Face. 189K downloads.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("MiniMaxAI/MiniMax-M2.7")

    transformerssafetensorsminimax_m2text-generationconversational

    API: huggingface.co/MiniMaxAI/MiniMax-M2.7

    Auto-discovered from HuggingFace trending. 925 likes, 189K downloads.

  • Nucleus ImageNewOpen

    NucleusAI · self-host

    Best for: Trending on HuggingFace (159 likes this week)

    How: Available on Hugging Face.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("NucleusAI/Nucleus-Image")

    diffuserssafetensorsmoesparse-moediffusion

    API: huggingface.co/NucleusAI/Nucleus-Image

    Auto-discovered from HuggingFace trending. 159 likes, 802 downloads.

  • Gemma 4 31B ItNewOpen

    google · self-host

    Best for: Trending on HuggingFace (2122 likes this week)

    How: Available on Hugging Face. 3513K downloads.

    Example: from transformers import AutoModelForCausalLM; model = AutoModelForCausalLM.from_pretrained("google/gemma-4-31B-it")

    transformerssafetensorsgemma4image-text-to-textconversational

    API: huggingface.co/google/gemma-4-31B-it

    Auto-discovered from HuggingFace trending. 2122 likes, 3.5M downloads.

  • DeepSeek V3.2Open

    DeepSeek · 164K tokens · self-host

    Best for: Long-context coding, upgraded V3 deployments

    How: Drop-in upgrade from V3. Uses Dynamic Sparse Attention for better long-context performance.

    Example: Feed your entire microservice codebase and get cross-service dependency analysis.

    HumanEval 94.0%
    codingmathsparse attention (DSA)MIT licenseimproved context
    Hardware to self-host
    VRAM: 350GB (quantized)
    GPU: 8× H100 80GB
    RAM: 512GB+ system RAM

    Same hardware footprint as V3 — 671B with sparse attention.

    API: api.deepseek.com OR self-host via vLLM. Same OpenAI-compatible API.

  • Mistral Large 3Open

    Mistral · 256K tokens · self-host

    Best for: European deployments, agent workflows, long-context multilingual apps

    How: Major upgrade from Large 2. MoE architecture with 41B active params. Same API, just change model ID.

    Example: Build a multi-tool agent that queries DBs, calls APIs, and generates reports in 30+ languages.

    MoE 41B active / 675B totalmultilingualfunction calling256K context
    Hardware to self-host
    VRAM: 350GB (quantized)
    GPU: 8× H100 80GB
    RAM: 512GB+ system RAM

    675B MoE (41B active). Datacenter class — most users go via api.mistral.ai.

    API: api.mistral.ai OR self-host via vLLM. OpenAI-compatible.

  • Ministral 3 (3B/8B/14B)Open

    Mistral · 128K tokens · self-host

    Best for: Edge deployment, on-device AI, lightweight vision tasks

    How: 3B fits on phones, 8B on laptops, 14B on dev GPUs. All have vision support.

    Example: Run 8B on a Jetson to classify manufacturing defects from camera feeds.

    edge-friendlyvisiondense3 sizes
    Hardware to self-host
    VRAM: 2GB (3B) / 6GB (8B) / 10GB (14B quantized)
    GPU: Phone/CPU (3B) · Laptop GPU (8B) · RTX 3060+ (14B)
    RAM: 8-16GB system RAM

    All three sizes are dense with vision. 3B runs on phones, 8B on laptops, 14B on dev GPUs.

    API: Ollama, vLLM, Hugging Face. Also on Mistral API.

  • Llama 4 MaverickOpen

    Meta · 1M tokens · self-host

    Best for: Self-hosted production deployments, privacy-sensitive workloads

    How: ollama run llama4-maverick OR deploy on vLLM with tensor parallelism. Also available hosted on Together/Groq.

    Example: Deploy on 2x A100 GPUs behind your API gateway for private code review.

    MMLU 88.4%HumanEval 84.8%
    multilingualmultimodalMoE architecture17B active / 400B total
    Hardware to self-host
    VRAM: 200GB (quantized)
    GPU: 2× H100 80GB or 4× A100 80GB
    RAM: 256GB system RAM

    400B total params (17B active). FP16 needs ~800GB, FP8 ~400GB, INT4 ~200GB.

    API: Self-host via vLLM, Ollama, or use via Together, Fireworks, Groq

  • Llama 4 ScoutOpen

    Meta · 10M tokens · self-host

    Best for: Processing entire codebases, very long documents, single-GPU deployments

    How: Fits on a single H100. Best open model for extreme context lengths.

    Example: Feed your entire monorepo into context and ask about cross-service dependencies.

    MMLU 86.2%
    longest context (10M)MoE 17B active / 109B totalfits single H100
    Hardware to self-host
    VRAM: 80GB
    GPU: 1× H100 80GB
    RAM: 128GB system RAM

    17B active params, fits in a single H100 at FP8.

    API: Same as Maverick — vLLM, Ollama, Together, Fireworks

  • Qwen 3 235BOpen

    Alibaba · 128K tokens · self-host

    Best for: Flexible thinking control, commercial self-hosting, multilingual

    How: Supports /think and /no_think tags to toggle reasoning on/off per request. Apache 2.0 = fully commercial.

    Example: Use /no_think for fast classification, /think for complex debugging — same model.

    AIME 2024 85.7%HumanEval 90.2%
    hybrid thinkingMoE 22B activeApache 2.0multilingual
    Hardware to self-host
    VRAM: 140GB (quantized)
    GPU: 4× A100 80GB or 2× H100
    RAM: 256GB+ system RAM

    235B total (22B active). MoE architecture — only 22B params active per forward pass.

    API: Self-host via vLLM/SGLang or use via Together, Fireworks. Also on Alibaba Cloud.

  • Qwen 3 30BOpen

    Alibaba · 128K tokens · self-host

    Best for: Local development, laptop-friendly reasoning, privacy

    How: Excellent for local dev. MoE means only 3B params active — fast on consumer hardware.

    Example: Run on your dev machine as a private coding assistant with reasoning.

    AIME 2024 66.7%
    MoE 3B active / 30B totalruns on consumer GPUhybrid thinking
    Hardware to self-host
    VRAM: 20GB (quantized) / 60GB (FP16)
    GPU: RTX 4090 24GB (quantized) or 1× A100
    RAM: 32GB+ system RAM

    30B total (3B active). The 3B active params make inference fast on consumer hardware.

    API: ollama run qwen3:30b — fits on RTX 4090 (24GB)

  • Gemma 3 27BOpen

    Google · 128K tokens · self-host

    Best for: On-device/edge deployment, multimodal at small scale

    How: ollama run gemma3:27b. Fits on RTX 3090/4090. Good multimodal + tool use at small size.

    Example: Run on a dev server to process screenshots and generate bug reports.

    MMLU 75.6%HumanEval 78.0%
    compactmultimodalruns on single GPUfunction calling
    Hardware to self-host
    VRAM: 18GB (quantized) / 54GB (FP16)
    GPU: RTX 3090/4090 24GB or 1× A100 40GB
    RAM: 32GB+ system RAM

    27B dense. Fits on a single high-end consumer GPU with quantization.

    API: Ollama, vLLM, Hugging Face. Also on Vertex AI.

  • Nomic Embed Text v2-MoEOpen

    Nomic AI · 8K tokens · self-host

    Best for: Self-hosted RAG, privacy-first search, zero-cost embeddings

    How: Self-host for zero cost. Comparable quality to OpenAI embeddings.

    Example: Run alongside pgvector on the same server — full RAG pipeline with zero API costs.

    MoE embeddingmatryoshkaApache 2.0self-hostable
    Hardware to self-host
    VRAM: 2GB or CPU-only
    GPU: Any — runs on CPU at reasonable speed
    RAM: 4-8GB system RAM

    Tiny MoE embedding model. CPU inference is fast enough for most use cases.

    API: pip install nomic OR Ollama. Also hosted on Nomic Atlas.

  • DeepSeek R1Open

    DeepSeek · 128K tokens · self-host

    Best for: Budget reasoning, self-hosted chain-of-thought, research

    How: API is OpenAI-compatible. Self-host the 70B distill on 2x A100. MIT license = no restrictions.

    Example: Run the 14B distill locally for debugging complex distributed system issues.

    AIME 2024 79.8%SWE-bench 49.2%GPQA Diamond 71.5%
    reasoningmathcodingMIT licensedistillable
    Hardware to self-host
    VRAM: 10GB (14B distill) / 48GB (70B distill) / 1TB+ (full 671B)
    GPU: RTX 4090 (14B) · 2× A100 (70B) · 8× H100 (full)
    RAM: Full model needs 256GB+ system RAM

    Full 671B MoE is massive. Distilled versions (14B, 32B, 70B) are far more practical.

    API: api.deepseek.com ($0.55/M in, $2.19/M out) OR self-host via vLLM/Ollama

  • Codestral 25.01Open

    Mistral · 256K tokens · self-host

    Best for: Code completion, inline suggestions, editor integration

    How: Supports FIM for inline completion. Integrate with any editor via LSP or Continue.dev.

    Example: Deploy as your team's FIM-capable completion server behind an LSP proxy.

    HumanEval 91.0%
    code completionFIM (fill-in-middle)80+ languages
    Hardware to self-host
    VRAM: 16GB (quantized) / 45GB (FP16)
    GPU: RTX 4090 24GB or 1× A100 40GB
    RAM: 32GB+ system RAM

    22B dense. Fits on a single consumer GPU with quantization.

    API: codestral.mistral.ai — dedicated code endpoint

  • Llama 3.3 70BOpen

    Meta · 128K tokens · self-host

    Best for: Proven workhorse for self-hosted deployments, fine-tuning base

    How: ollama run llama3.3:70b. For production: vLLM on 2x A100 or 4x A10G.

    Example: Fine-tune on your internal docs for a private knowledge base chatbot.

    MMLU 86.0%HumanEval 88.4%
    mature ecosystemfine-tuning friendlywide hardware support
    Hardware to self-host
    VRAM: 40GB (4-bit) / 140GB (FP16)
    GPU: 2× A100 80GB or 4× A10G 24GB
    RAM: 64GB+ system RAM

    70B dense. Widely supported — runs on Ollama with quantization on 48GB VRAM.

    API: Ollama, vLLM, TGI, or hosted (Together $0.60/M, Groq, Fireworks)

  • DeepSeek V3Open

    DeepSeek · 128K tokens · self-host

    Best for: Cost-sensitive production APIs, coding tasks, math-heavy pipelines

    How: Cheapest top-tier API. OpenAI-compatible. Self-host needs 8x A100.

    Example: Replace GPT-4 in your CI pipeline for automated code review at 1/10th the cost.

    HumanEval 92.1%MMLU 88.5%
    codingmathMoE 37B active / 671B totalMIT license
    Hardware to self-host
    VRAM: 350GB (quantized) / 1.3TB (FP16)
    GPU: 8× H100 80GB or 8× A100 80GB
    RAM: 512GB+ system RAM

    671B total (37B active). Most users rent via API — self-hosting needs datacenter hardware.

    API: api.deepseek.com ($0.27/M in, $1.10/M out) OR self-host

  • Phi-4Open

    Microsoft · 16K tokens · self-host

    Best for: Edge deployment, STEM tasks, embedded AI in products

    How: ollama run phi4. MIT license — embed in commercial products freely.

    Example: Embed in a CI pipeline to validate config files and Terraform plans.

    GPQA Diamond 56.2%MATH 80.4%
    14B paramsSTEM reasoningMIT licenseruns on laptop
    Hardware to self-host
    VRAM: 9GB (quantized) / 28GB (FP16)
    GPU: Any 8GB+ GPU (RTX 3060, laptop 4050, etc.)
    RAM: 16GB system RAM

    14B dense. Runs locally on most developer laptops with quantization.

    API: Ollama, Hugging Face, Azure AI

  • Qwen 2.5 Coder 32BOpen

    Alibaba · 128K tokens · self-host

    Best for: Private code completion, self-hosted Copilot replacement

    How: ollama run qwen2.5-coder:32b. Plug into Continue.dev or Copilot alternatives.

    Example: Set up as your team's private code completion backend — zero data leaves your infra.

    HumanEval 92.7%LiveCodeBench 48.5%
    code completioncode generationApache 2.0
    Hardware to self-host
    VRAM: 20GB (quantized) / 64GB (FP16)
    GPU: RTX 4090 24GB or 1× A100 40GB
    RAM: 32GB+ system RAM

    32B dense. Fits on a single consumer GPU with 4-bit quantization.

    API: Ollama, vLLM, or hosted on Together/Fireworks