AI Tips

Practical ways to train, run, or shrink AI models — explained for people new to AI. 13 new in last 30d.

New here? Each card answers one question: what is this and why should I care? Click a card to read the full explanation, including any new words. The command at the bottom is what you would type to try it on your own machine.

The AI flow — where each tip fits

Read left to right

An AI model goes through these five phases. Click a phase to see the tips that apply there.

  1. 1Pre-training

    A model first learns language by reading huge amounts of text. This costs millions of dollars and runs on thousands of GPUs.

  2. 2Fine-tuning

    You take that pre-trained model and teach it your own data, your own task, or your own writing style. Hours to days, on a few GPUs.

    e.g. QLoRA · Unsloth · DeepSpeed ZeRO-3 / FSDP

    See Training tips →
  3. 3Preference tuning

    After fine-tuning, you teach the model which answers humans prefer. This makes it polite, helpful, and on-topic.

    e.g. DPO / GRPO / KTO

    See Training tips →
  4. 4Quantization

    The trained model is huge. Quantization shrinks it about 4× by storing its numbers with less precision, so it fits on cheap hardware.

    e.g. GGUF + llama.cpp · AWQ / GPTQ · EXL2

    See Quantization tips →
  5. 5Inference / serving

    Running the model so users can ask it questions. This is what your app actually does in production.

    e.g. vLLM (PagedAttention) · Ollama · Speculative decoding

    See Inference tips →
DPO / GRPO / KTO — teach a model what good looks like

Modern ways to use human feedback to make a model prefer good answers over bad ones. Simpler than the old RLHF setup.

Training1× A100 80GB or QLoRA on 24GB

Classic RLHF (the method behind ChatGPT) trains a separate reward model first, then runs reinforcement learning. It works but it is complicated. DPO (Direct Preference Optimization) skips the reward model — you just give it pairs of (good answer, bad answer) and it directly adjusts the model. GRPO scales this to math and reasoning where you can verify correctness automatically (DeepSeek used it for their math model). KTO needs only single labels (was this answer good? yes/no), so you can use cheap data like in-app thumbs-up/down. All three are in the trl library.

Try it

pip install trl && python -m trl.scripts.dpo --model meta-llama/Llama-3.1-8B --dataset HuggingFaceH4/ultrafeedback_binarized
Source
Unsloth — 2× faster LoRA fine-tuning, half the VRAM

A drop-in library that makes fine-tuning twice as fast and uses half the GPU memory. Same result, less waiting.

TrainingRTX 3090 / 4090 24GB

Hugging Face's PEFT library is fine, but it is written in pure PyTorch which leaves performance on the table. Unsloth rewrites the LoRA forward and backward passes in Triton (NVIDIA's fast-kernel language). Result: the same loss curves, about 2× faster, about 50% less VRAM. Drop-in with Llama, Mistral, Phi, Gemma, Qwen — change a couple of import lines and it works. Their notebooks are a good starting point if you have never fine-tuned.

Try it

pip install unsloth && python -m unsloth.examples.llama3_8b_finetune
Source
DeepSpeed ZeRO-3 / PyTorch FSDP — train models too big for one GPU

Splits a giant model across many GPUs during training. The way teams fine-tune 70B+ models on 8 cards.

Training8× A100 80GB or H100s

When you train a model, you also need memory for gradients, optimizer state, and activations — together about 4× the model itself. ZeRO is a method that shards (splits) those across all your GPUs so each card only stores a slice. ZeRO-3 also shards the model parameters themselves. Combined with mixed-precision (bf16) training and activation checkpointing, this lets a normal 8× A100 box train a 70B model. PyTorch FSDP is the in-tree alternative with the same idea.

Try it

deepspeed --num_gpus 8 train.py --deepspeed ds_config_zero3.json
Source
QLoRA — fine-tune a 70B model on one consumer GPU

Teach a giant model new skills using only 24 GB of GPU memory, instead of the 320 GB you would normally need.

TrainingRTX 3090 / 4090 24GB

Fine-tuning means starting from an already-trained model and teaching it your own data. A 70B-parameter model normally needs about 320 GB of GPU memory (VRAM) to fine-tune — that costs thousands per hour in the cloud. QLoRA does two clever things: it stores the original model in 4-bit numbers (about 4× smaller), and it only trains a tiny add-on called a LoRA adapter, not the whole model. The full setup fits on a single 24 GB gaming card with no real loss in quality. Result: home-lab fine-tuning is suddenly possible.

Try it

pip install bitsandbytes peft transformers && python -m peft.examples.qlora --model meta-llama/Llama-3.1-70B --bits 4
Source
Boost Mixture-of-Experts training throughput with advanced fusion kernelsNewAuto

Increase training throughput for Mixture-of-Experts models using advanced fusion kernels

TrainingAny GPU

Advanced fusion kernels can significantly boost the training throughput of Mixture-of-Experts models, which are a key component in large-scale AI systems, by optimizing the communication between experts.

Try it

moe_model = MixtureOfExpertsModel()
advanced_fusion_kernels.optimize_training_throughput(moe_model)
Source
Fine-tune biological foundation models with LoRA using NVIDIA BioNeMoNewAuto

Use NVIDIA BioNeMo recipes to fine-tune biological foundation models with LoRA

TrainingAny GPU

NVIDIA BioNeMo provides recipes for fine-tuning biological foundation models with LoRA, allowing for efficient and effective updates to these large models in the field of computational biology.

Try it

nvidia_bionemo_recipes = NVIDIABioNeMoRecipes()
lora_finetuned_model = nvidia_bionemo_recipes.fine_tune_biological_model(model, data)
Source
Boost Mixture-of-Experts training throughput with advanced fusion kernelsNewAuto

Increase MoE model training throughput using advanced fusion kernels

TrainingRTX 3090 24GB

NVIDIA's blog post discusses how to boost the training throughput of Mixture-of-Experts (MoE) models by using advanced fusion kernels, which can significantly improve the efficiency of training large-scale AI systems.

Try it

python -m moe_train --advanced-fusion-kernels
Source
Fine-tune biological foundation models with LoRA using NVIDIA BioNeMoNewAuto

Use NVIDIA BioNeMo recipes to fine-tune foundation models with LoRA for computational biology tasks

TrainingRTX 3090 24GB

NVIDIA BioNeMo provides recipes for fine-tuning large foundation models like ESM2 using LoRA, allowing for efficient and effective updates to these models for specific computational biology tasks.

Try it

python -m biodemo.lora_finetune --config-file config.yaml
Source
Train models faster with JAX and MaxText using NVFP4 on NVIDIA BlackwellNewAuto

Improve throughput when training large language models with JAX and MaxText on NVIDIA Blackwell

TrainingNVIDIA Blackwell

When training spans trillions of tokens across thousands of accelerators, every percentage point of step improvement matters. Using JAX and MaxText with NVFP4 on NVIDIA Blackwell can help achieve better throughput, which is crucial for pre-training frontier LLMs.

Try it

# Example command using JAX and MaxText
# This is a placeholder command and should be replaced with the actual usage
jax.run(your_model_training_function)
Source
Train models faster with JAX and MaxText using NVFP4 on NVIDIA BlackwellNewAuto

Improve throughput when training large language models with JAX and MaxText on NVIDIA Blackwell

TrainingNVIDIA Blackwell

When training spans trillions of tokens across thousands of accelerators, every percentage point of step improvement matters. Using JAX with MaxText and NVFP4 on NVIDIA Blackwell can significantly improve throughput, leading to faster training times for large language models.

Try it

jax.run(your_model, your_data, max_text=True, nvfp4=True)
Source
Post-train autonomous vehicle models in closed-loop with NVIDIA AlpamayoNewAuto

Use NVIDIA Alpamayo to bridge the gap between training and deployment for AV policies

TrainingNVIDIA GPU

NVIDIA Alpamayo helps in post-training autonomous vehicle models in a closed-loop, which is crucial for developing effective AV policies. This tool can be used to fine-tune and validate models before deployment.

Try it

nvidia-alpamayo --train-model --input-data <data>
Source
Post-train AV models in closed-loop with NVIDIA AlpamayoNewAuto

Use NVIDIA Alpamayo for post-training AV models to bridge the gap between training and deployment.

TrainingNVIDIA Alpamayo

NVIDIA Alpamayo is designed to help developers post-train autonomous vehicle models in a closed-loop system, which is crucial for refining AV policies and ensuring they perform well in real-world scenarios.

Try it

# Example command for post-training with NVIDIA Alpamayo
# This is a placeholder command and may vary based on actual usage
alpamayo_post_train --model <model_path> --data <data_path>
Source
Synthesize realistic 3D medical images at scaleNewAuto

Use NVIDIA's method to generate high-quality 3D medical imaging data for radiology AI

TrainingRTX 3090 24GB

NVIDIA's method allows for the synthesis of realistic 3D medical images at scale, addressing data scarcity and privacy issues in radiology AI. This can be crucial for training AI models on diverse and representative datasets.

Try it

# Placeholder command, actual implementation depends on NVIDIA's tools and frameworks
Source
Properly evaluate AI agents using agentic techniquesNewAuto

Distinguish between evaluating AI models and AI agents

TrainingCPU only

Evaluating an AI model and evaluating an AI agent are related but answer fundamentally different questions. A model benchmark tests the capability of a model, whereas an AI agent evaluation focuses on how well the agent performs in a specific environment or task.

Source
Mastering Agentic Techniques for AI Agent EvaluationNewAuto

Understand the difference between evaluating AI models and AI agents

TrainingCPU only

Evaluating an AI model tests its capability, while evaluating an AI agent answers different questions. This distinction is crucial for developers to understand when assessing their AI systems.

Source
Fine-tune NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video GenerationNewAuto

Use LoRA/DoRA for fine-tuning NVIDIA Cosmos Predict 2.5 for robot video generation tasks

TrainingRTX 3090 24GB

This blog post details how to fine-tune NVIDIA's Cosmos Predict 2.5 model using LoRA/DoRA for robot video generation tasks. Fine-tuning allows the model to adapt to specific use cases, improving its performance on tasks like video generation for robotics.

Try it

model = AutoModelForCausalLM.from_pretrained('nvidia/cosmos-predict-2.5')
tokenizer = AutoTokenizer.from_pretrained('nvidia/cosmos-predict-2.5')

with torch.no_grad():
    inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
Source
Fine-tune NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video GenerationNewAuto

Use LoRA/DoRA to fine-tune NVIDIA Cosmos Predict 2.5 for improved robot video generation

TrainingRTX 3090 24GB

In this blog post, NVIDIA demonstrates how to fine-tune their Cosmos Predict 2.5 model using LoRA/DoRA for generating robot videos. This approach can potentially improve the quality and accuracy of generated videos, which is crucial for applications in robotics and autonomous systems.

Try it

python fine_tune.py --model cosmos-predict-2.5 --strategy lora-dora
Source
Learn from Parameter Golf AI-assisted research techniquesAuto

Explore AI-assisted machine learning research and model design

TrainingCPU only

Parameter Golf event gathered 1,000+ participants to explore AI-assisted research, coding agents, quantization, and novel model design under strict constraints.

Try it

# Placeholder for AI-assisted research commands
Source
Utilize AWS for foundation model training and inferenceAuto

AWS provides building blocks for training and inference of foundation models

TrainingAWS GPU instances

AWS offers various services and tools that can be used to train and deploy foundation models efficiently. These services can help manage the complexity of large-scale model training and inference.

Try it

aws s3 sync s3://my-bucket/path/to/model /path/to/local/model
Source