AI Tips
Practical ways to train, run, or shrink AI models — explained for people new to AI. 3 new in last 30d.
New here? Each card answers one question: what is this and why should I care? Click a card to read the full explanation, including any new words. The command at the bottom is what you would type to try it on your own machine.
The AI flow — where each tip fits
Read left to rightAn AI model goes through these five phases. Click a phase to see the tips that apply there.
- 1Pre-training
A model first learns language by reading huge amounts of text. This costs millions of dollars and runs on thousands of GPUs.
- 2Fine-tuning
You take that pre-trained model and teach it your own data, your own task, or your own writing style. Hours to days, on a few GPUs.
e.g. QLoRA · Unsloth · DeepSpeed ZeRO-3 / FSDP
See Training tips → - 3Preference tuning
After fine-tuning, you teach the model which answers humans prefer. This makes it polite, helpful, and on-topic.
e.g. DPO / GRPO / KTO
See Training tips → - 4Quantization
The trained model is huge. Quantization shrinks it about 4× by storing its numbers with less precision, so it fits on cheap hardware.
e.g. GGUF + llama.cpp · AWQ / GPTQ · EXL2
See Quantization tips → - 5Inference / serving
Running the model so users can ask it questions. This is what your app actually does in production.
e.g. vLLM (PagedAttention) · Ollama · Speculative decoding
See Inference tips →
AWQ / GPTQ — shrink a model 4× without losing qualitySmarter compression that picks which numbers matter most and keeps those accurate. The model takes a quarter of the memory but answers almost the same.
Quantizationany 16GB+ GPU
Quantization (storing numbers in 4-bit instead of 16-bit) usually loses some quality. AWQ — short for Activation-aware Weight Quantization — figures out which channels in the model are most important and protects them. The result: 4-bit AWQ keeps about 99% of the quality of the full model, but uses a quarter of the memory. GPTQ is the older method and slightly weaker. AWQ also runs faster on modern NVIDIA cards because vLLM and TGI ship optimized code for it.
Try it
python -c "from awq import AutoAWQForCausalLM; m = AutoAWQForCausalLM.from_quantized('casperhansen/llama-3-70b-awq')"GGUF + llama.cpp — run a 70B model on a gaming PCA file format that packs a model so it can split between your GPU and your normal computer memory. Lets a 70B model run on a 4090.
QuantizationRTX 4090 24GB (with offload)
GGUF is the file format llama.cpp uses to ship pre-shrunk (quantized) models. Quantization means storing the model's numbers with less precision so the file is smaller — Q4_K_M is the popular setting, about 4× smaller than the original. The clever part is layer offload: you tell llama.cpp how many layers to keep on the GPU (-ngl 35) and the rest stay in your normal computer RAM. So a 70B model that needs ~40 GB can split: 24 GB on a 4090 + 16 GB in system RAM. About 8 tokens per second on a 4090 with 64 GB DDR5.
Try it
llama-cli -m llama-3.1-70b-q4_k_m.gguf -ngl 35 -c 8192 -p "hello"Convert FP8 Checkpoints to NVIDIA TensorRT for High-Performance InferenceNewAutoBridge the gap between model optimization and production deployment
QuantizationNVIDIA GPU
This process enables faster inference by converting a quantized checkpoint into an NVIDIA TensorRT engine, which is crucial for deploying optimized models in production.
Try it
tensorrt_script = 'python post_training_quantization/convert_to_tensorrt.py --fp8_checkpoint /path/to/fp8/checkpoint --output /path/to/output/engine'Convert FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRTNewAutoBridging the gap between model optimization and production deployment
QuantizationNVIDIA GPU
Converting a quantized checkpoint into an NVIDIA TensorRT engine enables faster inference, improving model performance and reducing latency.
Try it
tensorrt_script = 'python -m torch.ao.quantization.fx.prepare_qat_fx ' + '-m model_fp32 ' + '-m model_qat'Google introduces Gemma 4 for QAT modelsNewAutoOptimizing model compression for mobile and laptop efficiency
QuantizationMobile and laptop GPUs
Google's Gemma 4 introduces advancements in quantization-aware training, which can significantly improve the efficiency of AI models on mobile and laptop devices. This is crucial for deploying models where computational resources are limited.
Try it
python train.py --quantizeReduce VRAM usage and improve inference performance with NVIDIA Model OptimizerAutoModel quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs.
QuantizationGeForce RTX GPUs
NVIDIA Model Optimizer supports post-training quantization, which can reduce VRAM usage and improve inference performance on consumer devices. This can be particularly useful for deploying models on devices with limited resources.
Try it
mo --input_model <model>.onnx --output_model <quantized_model>.onnx --data_type <INT8/FP16>Reduce VRAM usage and improve inference performance with model quantizationAutoUse NVIDIA Model Optimizer for post-training quantization to optimize models for consumer devices
QuantizationNVIDIA GeForce RTX GPUs
Model quantization reduces VRAM usage and improves inference performance on consumer devices such as NVIDIA GeForce RTX GPUs.
Try it
mo --input_model <model> --output_model <quantized_model> --data_type <INT8/FP16>