AI Models

52 models · 2 new in 60d

Compare →
  • Gemma 4 27B MoENewOpen

    Google · 128K tokens · self-host

    Best for: Faster self-hosted inference, cost-efficient multimodal

    How: MoE variant — faster inference than the 31B dense. Same multimodal capabilities.

    Example: Process image-based monitoring alerts faster than the dense variant at the same quality.

    LMSYS Arena #6 text
    MoE efficiencymultimodalimages + videoApache 2.0
    Hardware to self-host
    VRAM: 18GB (quantized) / 54GB (FP16)
    GPU: RTX 4090 24GB or 1× A100 40GB
    RAM: 32GB+ system RAM

    27B total MoE — faster inference than the 31B dense thanks to sparse activations.

    API: Ollama, vLLM, Hugging Face. ollama run gemma4:27b-moe

  • Gemma 4 E4BNewOpen

    Google · 128K tokens · self-host

    Best for: Edge, mobile, IoT, on-device AI with multimodal input

    How: 4B params — runs on any hardware. Supports images, video, AND native audio input.

    Example: Run on a Raspberry Pi to process security camera feeds with voice commands.

    tinyon-devicemultimodal + audioApache 2.0
    Hardware to self-host
    VRAM: 3GB (quantized) / 8GB (FP16)
    GPU: Any — CPU, phone, Jetson, Raspberry Pi 5, integrated GPU
    RAM: 4-8GB system RAM

    4B params. Edge-first design: runs on phones, SBCs, IoT devices.

    API: Ollama, Hugging Face. Runs on phones and Raspberry Pi.

  • Ministral 3 (3B/8B/14B)Open

    Mistral · 128K tokens · self-host

    Best for: Edge deployment, on-device AI, lightweight vision tasks

    How: 3B fits on phones, 8B on laptops, 14B on dev GPUs. All have vision support.

    Example: Run 8B on a Jetson to classify manufacturing defects from camera feeds.

    edge-friendlyvisiondense3 sizes
    Hardware to self-host
    VRAM: 2GB (3B) / 6GB (8B) / 10GB (14B quantized)
    GPU: Phone/CPU (3B) · Laptop GPU (8B) · RTX 3060+ (14B)
    RAM: 8-16GB system RAM

    All three sizes are dense with vision. 3B runs on phones, 8B on laptops, 14B on dev GPUs.

    API: Ollama, vLLM, Hugging Face. Also on Mistral API.

  • Gemini 2.5 Flash

    Google · 1M tokens · $0.15/M → $0.60/M

    Best for: High-volume processing, real-time apps, budget-conscious pipelines

    How: Set thinking_budget to control reasoning cost. 0 = no thinking, 24576 = max.

    Example: Summarize 1000 GitHub issues per hour for a triage dashboard at ~$1.

    speedcostlong contextthinking budget control

    API: Same SDK as Gemini Pro. model='gemini-2.5-flash-preview-05-20'

  • Claude Haiku 4.5

    Anthropic · 200K tokens · $0.80/M → $4/M

    Best for: Pipelines, batch processing, structured data extraction, routing

    How: Use for high-volume, low-complexity tasks: classification, extraction, summarization.

    Example: Process 10K support tickets per hour to classify priority and extract entities.

    HumanEval 88.5%
    speedcoststructured outputclassification

    API: api.anthropic.com — same SDK

  • GPT-4.1 mini

    OpenAI · 1M tokens · $0.40/M → $1.60/M

    Best for: Embeddings preprocessing, log parsing, lightweight generation

    How: Same API as GPT-4.1. Best for high-volume, simple tasks where cost matters.

    Example: Parse 50K structured logs per hour and extract error patterns.

    SWE-bench 28.8%HumanEval 92.5%
    costspeedlong context

    API: api.openai.com — same SDK

  • GPT-4.1 nano

    OpenAI · 1M tokens · $0.10/M → $0.40/M

    Best for: Intent classification, entity extraction at massive scale

    How: Use for routing, tagging, simple extraction where quality bar is lower.

    Example: Route 1M incoming messages per day to the right service for $4 total.

    ultra-cheapfastclassification

    API: api.openai.com — same SDK

  • Qwen 3 30BOpen

    Alibaba · 128K tokens · self-host

    Best for: Local development, laptop-friendly reasoning, privacy

    How: Excellent for local dev. MoE means only 3B params active — fast on consumer hardware.

    Example: Run on your dev machine as a private coding assistant with reasoning.

    AIME 2024 66.7%
    MoE 3B active / 30B totalruns on consumer GPUhybrid thinking
    Hardware to self-host
    VRAM: 20GB (quantized) / 60GB (FP16)
    GPU: RTX 4090 24GB (quantized) or 1× A100
    RAM: 32GB+ system RAM

    30B total (3B active). The 3B active params make inference fast on consumer hardware.

    API: ollama run qwen3:30b — fits on RTX 4090 (24GB)

  • Gemma 3 27BOpen

    Google · 128K tokens · self-host

    Best for: On-device/edge deployment, multimodal at small scale

    How: ollama run gemma3:27b. Fits on RTX 3090/4090. Good multimodal + tool use at small size.

    Example: Run on a dev server to process screenshots and generate bug reports.

    MMLU 75.6%HumanEval 78.0%
    compactmultimodalruns on single GPUfunction calling
    Hardware to self-host
    VRAM: 18GB (quantized) / 54GB (FP16)
    GPU: RTX 3090/4090 24GB or 1× A100 40GB
    RAM: 32GB+ system RAM

    27B dense. Fits on a single high-end consumer GPU with quantization.

    API: Ollama, vLLM, Hugging Face. Also on Vertex AI.

  • Phi-4Open

    Microsoft · 16K tokens · self-host

    Best for: Edge deployment, STEM tasks, embedded AI in products

    How: ollama run phi4. MIT license — embed in commercial products freely.

    Example: Embed in a CI pipeline to validate config files and Terraform plans.

    GPQA Diamond 56.2%MATH 80.4%
    14B paramsSTEM reasoningMIT licenseruns on laptop
    Hardware to self-host
    VRAM: 9GB (quantized) / 28GB (FP16)
    GPU: Any 8GB+ GPU (RTX 3060, laptop 4050, etc.)
    RAM: 16GB system RAM

    14B dense. Runs locally on most developer laptops with quantization.

    API: Ollama, Hugging Face, Azure AI

  • Moonshot v1 (8K/32K/128K)

    Moonshot AI · 8K / 32K / 128K tokens · $0.14/M → $0.28/M

    Best for: Batch processing, structured extraction, JSON pipelines

    How: Best for structured output tasks. Supports response_format: json_object. No reasoning overhead.

    Example: Process RSS feeds into structured summaries for pennies per 1000 articles.

    very cheapno hidden reasoningreliable JSON

    API: api.moonshot.ai — OpenAI-compatible. model='moonshot-v1-8k'