AI Tips

Practical ways to train, run, or shrink AI models — explained for people new to AI. 10 new in last 30d.

New here? Each card answers one question: what is this and why should I care? Click a card to read the full explanation, including any new words. The command at the bottom is what you would type to try it on your own machine.

The AI flow — where each tip fits

Read left to right

An AI model goes through these five phases. Click a phase to see the tips that apply there.

  1. 1Pre-training

    A model first learns language by reading huge amounts of text. This costs millions of dollars and runs on thousands of GPUs.

  2. 2Fine-tuning

    You take that pre-trained model and teach it your own data, your own task, or your own writing style. Hours to days, on a few GPUs.

    e.g. QLoRA · Unsloth · DeepSpeed ZeRO-3 / FSDP

    See Training tips →
  3. 3Preference tuning

    After fine-tuning, you teach the model which answers humans prefer. This makes it polite, helpful, and on-topic.

    e.g. DPO / GRPO / KTO

    See Training tips →
  4. 4Quantization

    The trained model is huge. Quantization shrinks it about 4× by storing its numbers with less precision, so it fits on cheap hardware.

    e.g. GGUF + llama.cpp · AWQ / GPTQ · EXL2

    See Quantization tips →
  5. 5Inference / serving

    Running the model so users can ask it questions. This is what your app actually does in production.

    e.g. vLLM (PagedAttention) · Ollama · Speculative decoding

    See Inference tips →
Pick the right serving stack — vLLM vs TGI vs Aphrodite

Three popular ways to host a model behind an API. vLLM is the safe default. TGI fits Hugging Face shops. Aphrodite is for heavy sampling.

Inferenceproduction GPU node

A serving stack is the program that turns a model into a web API. vLLM has the best throughput and an OpenAI-compatible API — it should be the default for new deployments. TGI (Text Generation Inference) is Hugging Face's, integrates nicely with their Inference Endpoints, and supports speculative decoding out of the box. Aphrodite is a vLLM fork tuned for high-temperature sampling, multi-LoRA, and role-play workloads. If you are unsure: start with vLLM.

Try it

# vLLM (most teams)
vllm serve mistralai/Mistral-Nemo-Instruct-2407
# TGI (HF infra)
docker run --gpus all ghcr.io/huggingface/text-generation-inference --model-id mistralai/Mistral-Nemo-Instruct-2407
Source
vLLM — serve a model 5–10× faster than the basic library

A serving engine that handles many requests at the same time without wasting GPU memory. The default choice for production.

Inference1× A100 / H100 / 4090

When a model answers, it has to remember what it has already written — this is called the KV cache and it eats a lot of GPU memory. The basic Hugging Face library reserves a fixed chunk for each request, which leaves a lot of memory unused. vLLM borrows the idea of virtual memory from operating systems: it splits the cache into small pages and reuses them, so the same GPU can serve 5–10× more requests at the same time. It also exposes an OpenAI-compatible HTTP API, so any client that talks to OpenAI can talk to vLLM with one URL change.

Try it

pip install vllm && vllm serve meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 1
Source
EXL2 — the fastest way to run shrunk models on a gaming GPU

Picks the best precision for each part of the model. About 2× faster than the GGUF format on a single RTX card.

InferenceRTX 3090 / 4090 / 5090

EXL2 is a quantization format made for ExLlamaV2, a runtime built only for NVIDIA GPUs (no CPU offload). It is smarter than the others because it spends more bits on the important parts of the model and fewer bits on the rest, hitting an average target like 4.0 bits per weight. On one RTX 4090 a Llama-3 70B in EXL2 4.0bpw runs at about 20 tokens per second — roughly twice the speed of an equivalent GGUF setup. Use this when you need pure speed on a single card.

Try it

pip install exllamav2 && python examples/inference.py -m models/llama-3-70b-exl2-4.0bpw -p 'hello'
Source
Shift test-time inference to edge to reduce latency and bandwidthNewAuto

Move inference from centralized data centers to edge devices to improve performance

InferenceCPU only

Legacy infrastructure was not designed for AI requirements. Test-time inference is rapidly shifting to the edge to reduce latency and bandwidth consumption. This shift can significantly improve performance for applications requiring real-time responses.

Try it

sudo apt-get install edge-device-software
Source
Simplify AI infrastructure at the edge with Cisco and CanonicalNewAuto

Test-time inference is shifting to the edge to reduce latency and bandwidth consumption

InferenceCPU only

Legacy infrastructure was not designed for AI era requirements. Large-scale model training remains centralized in data centers, but test-time inference is moving to the edge to reduce latency and bandwidth consumption.

Try it

# Example command to deploy AI model at the edge
sudo apt-get install -y edge-ai-model
edge-ai-model --deploy
Source
Optimize AI inference at the edge to reduce latency and bandwidthNewAuto

Shifting test-time inference to the edge can enhance performance

InferenceCPU only

As large-scale model training remains centralized in data centers, test-time inference is rapidly shifting to the edge to reduce latency and bandwidth consumption, which is crucial for real-time AI applications.

Try it

# Example command to deploy model inference at the edge
# This is a placeholder and actual command depends on the specific edge device and software stack
edge_deploy_model.sh
Source
GPU-accelerated video decoding with Vulkan in FirefoxNewAuto

Firefox now supports Vulkan Video for improved video decoding performance

InferenceVulkan compatible GPU

Mozilla Firefox has merged initial support for Vulkan Video, which allows for GPU-accelerated video decoding. This can lead to performance improvements and reduced CPU usage when playing videos in the browser, especially on systems with Vulkan-compatible GPUs.

Try it

# Enable Vulkan Video decoding in Firefox
# This may require enabling specific settings or flags in the browser
Source
Fast Startup for Inference Workloads on Kubernetes with NVIDIA DynamoNewAuto

Use NVIDIA Dynamo for fast startup of inference workloads on Kubernetes.

InferenceAny NVIDIA GPU

NVIDIA Dynamo provides fast startup for inference workloads on Kubernetes, addressing the cold-start problem in production inference deployments where demand fluctuates over time.

Try it

kubectl apply -f nvidia-dynamo.yaml
Source
Use NVIDIA Dynamo for fast inference workloads on KubernetesNewAuto

Accelerate inference workloads on Kubernetes with NVIDIA Dynamo.

InferenceNVIDIA GPU

NVIDIA Dynamo provides fast startup for inference workloads on Kubernetes, addressing the cold-start problem in production inference deployments where demand fluctuates over time.

Try it

N/A
Source
Use NVIDIA Dynamo for fast startup of inference workloads on KubernetesNewAuto

NVIDIA Dynamo reduces inference workload startup times on Kubernetes

InferenceAny NVIDIA GPU

NVIDIA Dynamo Snapshot is designed to address the cold-start problem in inference deployments by reducing the time it takes to spin up inference replicas. This can be particularly beneficial for fluctuating demand scenarios, allowing for more efficient scaling of inference services.

Try it

kubectl apply -f <your_inference_service>
Source
NVIDIA Dynamo Snapshot for fast inference workloads on KubernetesNewAuto

Optimize inference workloads on Kubernetes for fluctuating demand

InferenceNVIDIA GPU

NVIDIA Dynamo Snapshot addresses the cold-start problem in production inference deployments by enabling fast startup for inference workloads on Kubernetes, allowing inference replicas to scale elastically and handle demand fluctuations efficiently.

Try it

kubectl apply -f <dynamo-snapshot-config>
Source
Develop web apps with local LLM inferenceNewAuto

Use local inference to avoid costs and latency of metered AI APIs

InferenceCPU only

Canonical's approach to developing web apps with local LLM inference can help reduce costs and latency associated with metered AI APIs. This enables rapid iteration during development without incurring high costs.

Try it

llama-serve --model <model_path> --port <port>
Source
Develop web apps with local LLM inference to save costsNewAuto

Use local LLM inference for web app development to avoid metered API costs and enable rapid iteration.

InferenceCPU only

Working with metered AI APIs can be costly and slow down development. By using local LLM inference, developers can iterate rapidly and avoid high costs associated with API calls. This approach is particularly beneficial for web app development where rapid iteration is crucial.

Try it

# Example Python code for local LLM inference
from transformers import pipeline

# Load the local LLM model
llm = pipeline('llm', model='local-model-name')

# Perform inference
result = llm("Your input text")
print(result)
Source
Optimize AI model serving by reducing pipeline frictionAuto

Improve model serving efficiency by optimizing the pipeline

InferenceRTX 3090 24GB

Invest weeks in fine-tuning models and discover exporting to production is not smooth. Use NVIDIA's techniques to optimize the serving pipeline and reduce friction.

Try it

# NVIDIA's model optimization and serving tips
# Placeholder for specific commands
Source
Speed up Unreal Engine NNE inference with NVIDIA TensorRTAuto

Utilize NVIDIA TensorRT to accelerate neural network inference in Unreal Engine

InferenceRTX 3090 24GB

NVIDIA TensorRT is used to speed up neural network execution in Unreal Engine, which can be beneficial for applications requiring real-time inference, such as in gaming or computer graphics.

Try it

# Example command to integrate TensorRT with Unreal Engine NNE inference
# Please refer to NVIDIA's documentation for specific implementation details
# This is a placeholder and not a direct command
tensorrt_command
Source
Speed up Unreal Engine NNE inference with NVIDIA TensorRTAuto

NVIDIA TensorRT optimizes neural network inference for RTX runtime in Unreal Engine

InferenceRTX 3090 24GB

NVIDIA's TensorRT can be used to accelerate neural network inference in Unreal Engine, particularly for RTX runtime. This can significantly boost image quality and performance in computer graphics applications, making it suitable for high-performance graphics tasks.

Try it

# Example: Using NVIDIA TensorRT with Unreal Engine NNE
# This is a conceptual representation and not actual code
import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30
engine = builder.build_engine(network, config)
Source
Use multi-token prediction for faster inference in Gemma 4Auto

Multi-token prediction can speed up inference by predicting multiple tokens at once

InferenceCPU only

Google's Gemma 4 uses multi-token prediction to accelerate inference. This approach predicts multiple tokens at once, which can significantly speed up inference times, especially for models that generate sequences of tokens.

Try it

# Example: Multi-token prediction in Gemma 4
# This is a conceptual representation and not actual code
multi_token_prediction = True
inference_speedup = perform_inference(model, multi_token_prediction)
Source
Speed up Unreal Engine NNE inference with NVIDIA TensorRTAuto

Use NVIDIA TensorRT to accelerate neural network inference in Unreal Engine

InferenceRTX 3090 24GB

NVIDIA TensorRT is used to speed up neural network inference in Unreal Engine, which can be beneficial for applications requiring real-time image processing and graphics rendering.

Try it

# Example command to use TensorRT for inference
# This is a placeholder and actual command may vary based on specific use case
./inference_engine --model=model.trt --input=input_data
Source
Speed up Unreal Engine NNE inference with NVIDIA TensorRTAuto

NVIDIA TensorRT can accelerate neural network inference in Unreal Engine

InferenceRTX 30 series

NVIDIA TensorRT is used to speed up neural network inference in Unreal Engine, which can be beneficial for applications that require real-time graphics processing. The blog post suggests that using TensorRT can lead to significant performance improvements for tasks such as image quality enhancement and performance optimization.

Try it

# Example command to integrate TensorRT with Unreal Engine NNE
# Please refer to NVIDIA's documentation for specific implementation details
# https://developer.nvidia.com/blog/speed-up-unreal-engine-nne-inference-with-nvidia-tensorrt-for-rtx-runtime/
Source