AI Tips
Practical ways to train, run, or shrink AI models — explained for people new to AI. 13 new in last 30d.
New here? Each card answers one question: what is this and why should I care? Click a card to read the full explanation, including any new words. The command at the bottom is what you would type to try it on your own machine.
The AI flow — where each tip fits
Read left to rightAn AI model goes through these five phases. Click a phase to see the tips that apply there.
- 1Pre-training
A model first learns language by reading huge amounts of text. This costs millions of dollars and runs on thousands of GPUs.
- 2Fine-tuning
You take that pre-trained model and teach it your own data, your own task, or your own writing style. Hours to days, on a few GPUs.
e.g. QLoRA · Unsloth · DeepSpeed ZeRO-3 / FSDP
See Training tips → - 3Preference tuning
After fine-tuning, you teach the model which answers humans prefer. This makes it polite, helpful, and on-topic.
e.g. DPO / GRPO / KTO
See Training tips → - 4Quantization
The trained model is huge. Quantization shrinks it about 4× by storing its numbers with less precision, so it fits on cheap hardware.
e.g. GGUF + llama.cpp · AWQ / GPTQ · EXL2
See Quantization tips → - 5Inference / serving
Running the model so users can ask it questions. This is what your app actually does in production.
e.g. vLLM (PagedAttention) · Ollama · Speculative decoding
See Inference tips →
DPO / GRPO / KTO — teach a model what good looks likeModern ways to use human feedback to make a model prefer good answers over bad ones. Simpler than the old RLHF setup.
Training1× A100 80GB or QLoRA on 24GB
Classic RLHF (the method behind ChatGPT) trains a separate reward model first, then runs reinforcement learning. It works but it is complicated. DPO (Direct Preference Optimization) skips the reward model — you just give it pairs of (good answer, bad answer) and it directly adjusts the model. GRPO scales this to math and reasoning where you can verify correctness automatically (DeepSeek used it for their math model). KTO needs only single labels (was this answer good? yes/no), so you can use cheap data like in-app thumbs-up/down. All three are in the trl library.
Try it
pip install trl && python -m trl.scripts.dpo --model meta-llama/Llama-3.1-8B --dataset HuggingFaceH4/ultrafeedback_binarizedUnsloth — 2× faster LoRA fine-tuning, half the VRAMA drop-in library that makes fine-tuning twice as fast and uses half the GPU memory. Same result, less waiting.
TrainingRTX 3090 / 4090 24GB
Hugging Face's PEFT library is fine, but it is written in pure PyTorch which leaves performance on the table. Unsloth rewrites the LoRA forward and backward passes in Triton (NVIDIA's fast-kernel language). Result: the same loss curves, about 2× faster, about 50% less VRAM. Drop-in with Llama, Mistral, Phi, Gemma, Qwen — change a couple of import lines and it works. Their notebooks are a good starting point if you have never fine-tuned.
Try it
pip install unsloth && python -m unsloth.examples.llama3_8b_finetuneDeepSpeed ZeRO-3 / PyTorch FSDP — train models too big for one GPUSplits a giant model across many GPUs during training. The way teams fine-tune 70B+ models on 8 cards.
Training8× A100 80GB or H100s
When you train a model, you also need memory for gradients, optimizer state, and activations — together about 4× the model itself. ZeRO is a method that shards (splits) those across all your GPUs so each card only stores a slice. ZeRO-3 also shards the model parameters themselves. Combined with mixed-precision (bf16) training and activation checkpointing, this lets a normal 8× A100 box train a 70B model. PyTorch FSDP is the in-tree alternative with the same idea.
Try it
deepspeed --num_gpus 8 train.py --deepspeed ds_config_zero3.jsonQLoRA — fine-tune a 70B model on one consumer GPUTeach a giant model new skills using only 24 GB of GPU memory, instead of the 320 GB you would normally need.
TrainingRTX 3090 / 4090 24GB
Fine-tuning means starting from an already-trained model and teaching it your own data. A 70B-parameter model normally needs about 320 GB of GPU memory (VRAM) to fine-tune — that costs thousands per hour in the cloud. QLoRA does two clever things: it stores the original model in 4-bit numbers (about 4× smaller), and it only trains a tiny add-on called a LoRA adapter, not the whole model. The full setup fits on a single 24 GB gaming card with no real loss in quality. Result: home-lab fine-tuning is suddenly possible.
Try it
pip install bitsandbytes peft transformers && python -m peft.examples.qlora --model meta-llama/Llama-3.1-70B --bits 4Boost Mixture-of-Experts training throughput with advanced fusion kernelsNewAutoIncrease training throughput for Mixture-of-Experts models using advanced fusion kernels
TrainingAny GPU
Advanced fusion kernels can significantly boost the training throughput of Mixture-of-Experts models, which are a key component in large-scale AI systems, by optimizing the communication between experts.
Try it
moe_model = MixtureOfExpertsModel()
advanced_fusion_kernels.optimize_training_throughput(moe_model)Fine-tune biological foundation models with LoRA using NVIDIA BioNeMoNewAutoUse NVIDIA BioNeMo recipes to fine-tune biological foundation models with LoRA
TrainingAny GPU
NVIDIA BioNeMo provides recipes for fine-tuning biological foundation models with LoRA, allowing for efficient and effective updates to these large models in the field of computational biology.
Try it
nvidia_bionemo_recipes = NVIDIABioNeMoRecipes()
lora_finetuned_model = nvidia_bionemo_recipes.fine_tune_biological_model(model, data)Boost Mixture-of-Experts training throughput with advanced fusion kernelsNewAutoIncrease MoE model training throughput using advanced fusion kernels
TrainingRTX 3090 24GB
NVIDIA's blog post discusses how to boost the training throughput of Mixture-of-Experts (MoE) models by using advanced fusion kernels, which can significantly improve the efficiency of training large-scale AI systems.
Try it
python -m moe_train --advanced-fusion-kernelsFine-tune biological foundation models with LoRA using NVIDIA BioNeMoNewAutoUse NVIDIA BioNeMo recipes to fine-tune foundation models with LoRA for computational biology tasks
TrainingRTX 3090 24GB
NVIDIA BioNeMo provides recipes for fine-tuning large foundation models like ESM2 using LoRA, allowing for efficient and effective updates to these models for specific computational biology tasks.
Try it
python -m biodemo.lora_finetune --config-file config.yamlTrain models faster with JAX and MaxText using NVFP4 on NVIDIA BlackwellNewAutoImprove throughput when training large language models with JAX and MaxText on NVIDIA Blackwell
TrainingNVIDIA Blackwell
When training spans trillions of tokens across thousands of accelerators, every percentage point of step improvement matters. Using JAX and MaxText with NVFP4 on NVIDIA Blackwell can help achieve better throughput, which is crucial for pre-training frontier LLMs.
Try it
# Example command using JAX and MaxText
# This is a placeholder command and should be replaced with the actual usage
jax.run(your_model_training_function)Train models faster with JAX and MaxText using NVFP4 on NVIDIA BlackwellNewAutoImprove throughput when training large language models with JAX and MaxText on NVIDIA Blackwell
TrainingNVIDIA Blackwell
When training spans trillions of tokens across thousands of accelerators, every percentage point of step improvement matters. Using JAX with MaxText and NVFP4 on NVIDIA Blackwell can significantly improve throughput, leading to faster training times for large language models.
Try it
jax.run(your_model, your_data, max_text=True, nvfp4=True)Post-train autonomous vehicle models in closed-loop with NVIDIA AlpamayoNewAutoUse NVIDIA Alpamayo to bridge the gap between training and deployment for AV policies
TrainingNVIDIA GPU
NVIDIA Alpamayo helps in post-training autonomous vehicle models in a closed-loop, which is crucial for developing effective AV policies. This tool can be used to fine-tune and validate models before deployment.
Try it
nvidia-alpamayo --train-model --input-data <data>Post-train AV models in closed-loop with NVIDIA AlpamayoNewAutoUse NVIDIA Alpamayo for post-training AV models to bridge the gap between training and deployment.
TrainingNVIDIA Alpamayo
NVIDIA Alpamayo is designed to help developers post-train autonomous vehicle models in a closed-loop system, which is crucial for refining AV policies and ensuring they perform well in real-world scenarios.
Try it
# Example command for post-training with NVIDIA Alpamayo
# This is a placeholder command and may vary based on actual usage
alpamayo_post_train --model <model_path> --data <data_path>Synthesize realistic 3D medical images at scaleNewAutoUse NVIDIA's method to generate high-quality 3D medical imaging data for radiology AI
TrainingRTX 3090 24GB
NVIDIA's method allows for the synthesis of realistic 3D medical images at scale, addressing data scarcity and privacy issues in radiology AI. This can be crucial for training AI models on diverse and representative datasets.
Try it
# Placeholder command, actual implementation depends on NVIDIA's tools and frameworksProperly evaluate AI agents using agentic techniquesNewAutoDistinguish between evaluating AI models and AI agents
TrainingCPU only
Evaluating an AI model and evaluating an AI agent are related but answer fundamentally different questions. A model benchmark tests the capability of a model, whereas an AI agent evaluation focuses on how well the agent performs in a specific environment or task.
SourceMastering Agentic Techniques for AI Agent EvaluationNewAutoUnderstand the difference between evaluating AI models and AI agents
TrainingCPU only
Evaluating an AI model tests its capability, while evaluating an AI agent answers different questions. This distinction is crucial for developers to understand when assessing their AI systems.
SourceFine-tune NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video GenerationNewAutoUse LoRA/DoRA for fine-tuning NVIDIA Cosmos Predict 2.5 for robot video generation tasks
TrainingRTX 3090 24GB
This blog post details how to fine-tune NVIDIA's Cosmos Predict 2.5 model using LoRA/DoRA for robot video generation tasks. Fine-tuning allows the model to adapt to specific use cases, improving its performance on tasks like video generation for robotics.
Try it
model = AutoModelForCausalLM.from_pretrained('nvidia/cosmos-predict-2.5')
tokenizer = AutoTokenizer.from_pretrained('nvidia/cosmos-predict-2.5')
with torch.no_grad():
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logitsFine-tune NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video GenerationNewAutoUse LoRA/DoRA to fine-tune NVIDIA Cosmos Predict 2.5 for improved robot video generation
TrainingRTX 3090 24GB
In this blog post, NVIDIA demonstrates how to fine-tune their Cosmos Predict 2.5 model using LoRA/DoRA for generating robot videos. This approach can potentially improve the quality and accuracy of generated videos, which is crucial for applications in robotics and autonomous systems.
Try it
python fine_tune.py --model cosmos-predict-2.5 --strategy lora-doraLearn from Parameter Golf AI-assisted research techniquesAutoExplore AI-assisted machine learning research and model design
TrainingCPU only
Parameter Golf event gathered 1,000+ participants to explore AI-assisted research, coding agents, quantization, and novel model design under strict constraints.
Try it
# Placeholder for AI-assisted research commandsUtilize AWS for foundation model training and inferenceAutoAWS provides building blocks for training and inference of foundation models
TrainingAWS GPU instances
AWS offers various services and tools that can be used to train and deploy foundation models efficiently. These services can help manage the complexity of large-scale model training and inference.
Try it
aws s3 sync s3://my-bucket/path/to/model /path/to/local/model