AI Tips

Practical ways to train, run, or shrink AI models — explained for people new to AI. 0 new in last 30d.

New here? Each card answers one question: what is this and why should I care? Click a card to read the full explanation, including any new words. The command at the bottom is what you would type to try it on your own machine.

The AI flow — where each tip fits

Read left to right

An AI model goes through these five phases. Click a phase to see the tips that apply there.

  1. 1Pre-training

    A model first learns language by reading huge amounts of text. This costs millions of dollars and runs on thousands of GPUs.

  2. 2Fine-tuning

    You take that pre-trained model and teach it your own data, your own task, or your own writing style. Hours to days, on a few GPUs.

    e.g. QLoRA · Unsloth · DeepSpeed ZeRO-3 / FSDP

    See Training tips →
  3. 3Preference tuning

    After fine-tuning, you teach the model which answers humans prefer. This makes it polite, helpful, and on-topic.

    e.g. DPO / GRPO / KTO

    See Training tips →
  4. 4Quantization

    The trained model is huge. Quantization shrinks it about 4× by storing its numbers with less precision, so it fits on cheap hardware.

    e.g. GGUF + llama.cpp · AWQ / GPTQ · EXL2

    See Quantization tips →
  5. 5Inference / serving

    Running the model so users can ask it questions. This is what your app actually does in production.

    e.g. vLLM (PagedAttention) · Ollama · Speculative decoding

    See Inference tips →
Mixture of Experts (MoE) — huge model, fast model latency

The model is a team of small experts. For each word it only uses two of them. So a 141B model answers as fast as a 39B one.

Techniquedepends on top-k

A normal ('dense') model uses all of its parameters for every word. An MoE replaces one block with several smaller experts, plus a router that picks the top-k experts (usually 2) per word. So Mixtral-8x22B has 141B total parameters but only ~39B are active per word — answers come at the speed of a 39B model. The trade-off: you still have to load all experts in GPU memory, so VRAM is high. DeepSeek-V3 and Mixtral are the well-known open MoEs.

Try it

vllm serve mistralai/Mixtral-8x22B-Instruct-v0.1 --tensor-parallel-size 4
Source
Speculative decoding — make the big model faster, for free

A tiny model guesses the next words. The big model just checks the guesses in one batch. 2–3× faster, same answer quality.

Techniqueany inference target

Big models are slow because they generate one word (or token) at a time. With speculative decoding you also load a small, cheap 'draft' model. The draft model writes the next 5 tokens. The big model then runs once and checks all 5 in parallel. Tokens it agrees with are kept; tokens it does not agree with are replaced. The final answer is identical to what the big model would have written alone — you just spent fewer expensive runs to get there. Built into vLLM, TGI, and llama.cpp.

Try it

vllm serve meta-llama/Llama-3.1-70B --speculative-model meta-llama/Llama-3.2-1B --num-speculative-tokens 5
Source