AI Tips

Practical ways to train, run, or shrink AI models — explained for people new to AI. 3 new in last 30d.

New here? Each card answers one question: what is this and why should I care? Click a card to read the full explanation, including any new words. The command at the bottom is what you would type to try it on your own machine.

The AI flow — where each tip fits

Read left to right

An AI model goes through these five phases. Click a phase to see the tips that apply there.

1Pre-training
A model first learns language by reading huge amounts of text. This costs millions of dollars and runs on thousands of GPUs.
2Fine-tuning
You take that pre-trained model and teach it your own data, your own task, or your own writing style. Hours to days, on a few GPUs.
e.g. QLoRA · Unsloth · DeepSpeed ZeRO-3 / FSDP
See Training tips →
3Preference tuning
After fine-tuning, you teach the model which answers humans prefer. This makes it polite, helpful, and on-topic.
e.g. DPO / GRPO / KTO
See Training tips →
4Quantization
The trained model is huge. Quantization shrinks it about 4× by storing its numbers with less precision, so it fits on cheap hardware.
e.g. GGUF + llama.cpp · AWQ / GPTQ · EXL2
See Quantization tips →
5Inference / serving
Running the model so users can ask it questions. This is what your app actually does in production.
e.g. vLLM (PagedAttention) · Ollama · Speculative decoding
See Inference tips →

Live

AWQ / GPTQ — shrink a model 4× without losing quality

Smarter compression that picks which numbers matter most and keeps those accurate. The model takes a quarter of the memory but answers almost the same.

Quantizationany 16GB+ GPU

Quantization (storing numbers in 4-bit instead of 16-bit) usually loses some quality. AWQ — short for Activation-aware Weight Quantization — figures out which channels in the model are most important and protects them. The result: 4-bit AWQ keeps about 99% of the quality of the full model, but uses a quarter of the memory. GPTQ is the older method and slightly weaker. AWQ also runs faster on modern NVIDIA cards because vLLM and TGI ship optimized code for it.

Try it

python -c "from awq import AutoAWQForCausalLM; m = AutoAWQForCausalLM.from_quantized('casperhansen/llama-3-70b-awq')"

Source

GGUF + llama.cpp — run a 70B model on a gaming PC

A file format that packs a model so it can split between your GPU and your normal computer memory. Lets a 70B model run on a 4090.

QuantizationRTX 4090 24GB (with offload)

GGUF is the file format llama.cpp uses to ship pre-shrunk (quantized) models. Quantization means storing the model's numbers with less precision so the file is smaller — Q4_K_M is the popular setting, about 4× smaller than the original. The clever part is layer offload: you tell llama.cpp how many layers to keep on the GPU (-ngl 35) and the rest stay in your normal computer RAM. So a 70B model that needs ~40 GB can split: 24 GB on a 4090 + 16 GB in system RAM. About 8 tokens per second on a 4090 with 64 GB DDR5.

Try it

llama-cli -m llama-3.1-70b-q4_k_m.gguf -ngl 35 -c 8192 -p "hello"

Source

Convert FP8 Checkpoints to NVIDIA TensorRT for High-Performance InferenceNewAuto

Bridge the gap between model optimization and production deployment

QuantizationNVIDIA GPU

This process enables faster inference by converting a quantized checkpoint into an NVIDIA TensorRT engine, which is crucial for deploying optimized models in production.

Try it

tensorrt_script = 'python post_training_quantization/convert_to_tensorrt.py --fp8_checkpoint /path/to/fp8/checkpoint --output /path/to/output/engine'

Source

Convert FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRTNewAuto

Bridging the gap between model optimization and production deployment

QuantizationNVIDIA GPU

Converting a quantized checkpoint into an NVIDIA TensorRT engine enables faster inference, improving model performance and reducing latency.

Try it

tensorrt_script = 'python -m torch.ao.quantization.fx.prepare_qat_fx ' + '-m model_fp32 ' + '-m model_qat'

Source

Google introduces Gemma 4 for QAT modelsNewAuto

Optimizing model compression for mobile and laptop efficiency

QuantizationMobile and laptop GPUs

Google's Gemma 4 introduces advancements in quantization-aware training, which can significantly improve the efficiency of AI models on mobile and laptop devices. This is crucial for deploying models where computational resources are limited.

Try it

python train.py --quantize

Source

Reduce VRAM usage and improve inference performance with NVIDIA Model OptimizerAuto

Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs.

QuantizationGeForce RTX GPUs

NVIDIA Model Optimizer supports post-training quantization, which can reduce VRAM usage and improve inference performance on consumer devices. This can be particularly useful for deploying models on devices with limited resources.

Try it

mo --input_model <model>.onnx --output_model <quantized_model>.onnx --data_type <INT8/FP16>

Source

Reduce VRAM usage and improve inference performance with model quantizationAuto

Use NVIDIA Model Optimizer for post-training quantization to optimize models for consumer devices

QuantizationNVIDIA GeForce RTX GPUs

Model quantization reduces VRAM usage and improves inference performance on consumer devices such as NVIDIA GeForce RTX GPUs.

Try it

mo --input_model <model> --output_model <quantized_model> --data_type <INT8/FP16>

Source