AI Tips

Practical ways to train, run, or shrink AI models — explained for people new to AI. 19 new in last 30d.

New here? Each card answers one question: what is this and why should I care? Click a card to read the full explanation, including any new words. The command at the bottom is what you would type to try it on your own machine.

The AI flow — where each tip fits

Read left to right

An AI model goes through these five phases. Click a phase to see the tips that apply there.

1Pre-training
A model first learns language by reading huge amounts of text. This costs millions of dollars and runs on thousands of GPUs.
2Fine-tuning
You take that pre-trained model and teach it your own data, your own task, or your own writing style. Hours to days, on a few GPUs.
e.g. QLoRA · Unsloth · DeepSpeed ZeRO-3 / FSDP
See Training tips →
3Preference tuning
After fine-tuning, you teach the model which answers humans prefer. This makes it polite, helpful, and on-topic.
e.g. DPO / GRPO / KTO
See Training tips →
4Quantization
The trained model is huge. Quantization shrinks it about 4× by storing its numbers with less precision, so it fits on cheap hardware.
e.g. GGUF + llama.cpp · AWQ / GPTQ · EXL2
See Quantization tips →
5Inference / serving
Running the model so users can ask it questions. This is what your app actually does in production.
e.g. vLLM (PagedAttention) · Ollama · Speculative decoding
See Inference tips →

Live

Liger Kernel — about 20% less training VRAM, one import line

Replaces a few slow parts of the model with faster code. You add one line and your training uses 20% less GPU memory.

Toolingany training rig

When a model trains, certain math operations create big temporary tensors (RMSNorm, SwiGLU, RoPE, the loss function). Liger Kernel — open-sourced by LinkedIn — rewrites these in Triton so the temporaries never get created. The result: about 20% less VRAM on Llama-3 training, up to 30% less on long-context runs, no change in accuracy. Works alongside FSDP and LoRA. One import line and you are done.

Try it

pip install liger-kernel && python -c "from liger_kernel.transformers import apply_liger_kernel_to_llama; apply_liger_kernel_to_llama()"

Ollama — one command to run any model on your laptop

Like 'docker run' but for AI models. Auto-downloads, picks a good size for your machine, and exposes a local API.

ToolingMac / any GPU

Ollama wraps llama.cpp in a clean command-line tool plus a local HTTP API on port 11434 that speaks the OpenAI format. You type one command and it downloads the model, picks a quantized size that fits your machine, and starts serving. Perfect for testing prototypes, building agents, and anything on a Mac with Apple Silicon. The Modelfile lets you bake in a system prompt so the model behaves the way you want every time.

Try it

curl -fsSL https://ollama.com/install.sh | sh && ollama run llama3.1:70b-instruct-q4_K_M

FlashAttention 3 — about 2× faster attention on H100 / Blackwell

A faster version of the most expensive math step inside an AI model. Free speed-up on the newest NVIDIA cards.

ToolingH100 / B100 / B200 / 5090

Attention is the math step that lets the model look at every word it has read so far. Done naively it reads the GPU's slow memory many times per word. FlashAttention rewrites the same math so it streams data through the GPU's tiny fast memory once. Version 3 adds support for FP8 (8-bit math) on H100 and Blackwell GPUs, which doubles throughput compared to FP16. PyTorch 2.5 picks it up automatically when you call scaled_dot_product_attention on supported hardware. The biggest free perf win you can flip on a transformer.

Try it

pip install flash-attn --no-build-isolation

Vulkan Video support in Firefox for GPU-accelerated video decodingNewAuto

Mozilla Firefox now supports Vulkan Video for GPU-accelerated video decoding

ToolingVulkan compatible GPU

This feature allows Firefox to leverage the Vulkan API for video decoding, potentially improving performance and efficiency on systems with Vulkan-compatible GPUs.

Try it

firefox --enable-features=VulkanVideoDecode

NVIDIA Releases CUDA-Oxide for Rust-To-CUDA CompilerNewAuto

Experimental Rust-to-CUDA compiler for safer GPU kernel development

ToolingNVIDIA GPUs

CUDA-Oxide allows developers to write CUDA GPU kernels in Rust, providing a safer alternative to traditional CUDA C/C++ development. This can lead to more robust and maintainable GPU code.

Try it

rustc my_cuda_kernel.rs --crate-name cuda_oxide

Optimize Kernel Tracing for Function ParametersNewAuto

Improve kernel tracing by considering unused function parameters.

ToolingCPU only

Optimizing compilers may remove unused function parameters, which can affect kernel tracing or BPF subsystems. To address this, consider the implications of such optimizations on tracing functionality and adjust accordingly.

Try it

# Example of optimizing kernel tracing
# This is a placeholder command as the actual implementation depends on the specific kernel function and tracing setup.
# Please refer to the source article for detailed instructions.
echo 'Optimize kernel tracing for unused parameters'

Advance AI infrastructure with NVIDIA DOCA In-Silicon SecurityNewAuto

Utilize NVIDIA DOCA for enhancing AI infrastructure with in-silicon security.

ToolingNVIDIA DOCA

NVIDIA DOCA introduces a new class of infrastructure for AI, transforming data into intelligence for autonomous AI agents with unprecedented security and efficiency, making it ideal for building AI factories.

Try it

# Example command for using NVIDIA DOCA
# This is a placeholder command and may vary based on actual usage
doca_command --option <option_value>

Automate AI model documentation with NVIDIA MCG ToolkitNewAuto

Use NVIDIA MCG Toolkit to streamline AI model documentation.

ToolingAny NVIDIA GPU

The NVIDIA MCG Toolkit automates AI model documentation, which is crucial for regulatory compliance and understanding model complexity.

Try it

nvidia-mcg-toolkit --help

Automate AI model documentation with NVIDIA MCG ToolkitNewAuto

Use NVIDIA MCG Toolkit to automate AI model documentation for regulatory compliance.

ToolingN/A

As AI models grow in complexity, regulatory scrutiny intensifies. The NVIDIA MCG Toolkit automates AI model documentation, helping teams comply with frameworks like California's AB-2013 and the EU AI Act.

Try it

N/A

Run Step 3.7 Flash on NVIDIA GPUs for multimodal AINewAuto

Deploy multimodal AI applications on NVIDIA GPUs using Step 3.7 Flash.

ToolingAny NVIDIA GPU

AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and text. Step 3.7 Flash on NVIDIA GPUs enables this.

Try it

nvidia-smi

Automate AI model documentation with NVIDIA MCG ToolkitNewAuto

Use NVIDIA MCG Toolkit to automate AI model documentation for regulatory compliance.

ToolingCPU only

The NVIDIA MCG Toolkit helps automate AI model documentation, which is crucial for regulatory scrutiny under frameworks like California’s AB-2013 and the EU AI Act.

Try it

nvidia-mcg --help

Leverage CUDA Python 1.0 for unified GPU programmingNewAuto

CUDA 13.3 introduces CUDA Python 1.0 for easier GPU programming

ToolingAny NVIDIA GPU

CUDA 13.3 includes the release of CUDA Python 1.0, which provides a unified GPU programming model. This allows developers to write Python code that can run on both CPUs and GPUs, simplifying the development of GPU-accelerated applications.

Try it

import cupy as cp

CUDA Python 1.0 released for unified GPU programmingNewAuto

Simplify GPU programming with CUDA Python 1.0

ToolingNVIDIA GPU

CUDA 13.3 introduces CUDA Python 1.0, which provides a familiar Python-based interface for GPU programming, making it easier to develop and deploy GPU-accelerated applications across various domains.

Try it

import cupy as cp

CUDA 13.3 Enhances GPU Development with Tile Programming and AutotuningNewAuto

CUDA 13.3 introduces tile programming in C++, compiler autotuning, and Python updates.

ToolingAny NVIDIA GPU

NVIDIA CUDA 13.3 brings new capabilities and performance optimizations, including tile programming and compiler autotuning, benefiting developers across the CUDA ecosystem.

Try it

nvcc -tile my_kernel.cu

Develop High-Performance GPU Kernels with CUDA TileNewAuto

NVIDIA CUDA Tile allows optimizing GPU kernels using tile-based programming in C++.

ToolingAny NVIDIA GPU

NVIDIA CUDA Tile programming enables developers to write highly optimized GPU kernels within existing C++ codebases, improving performance.

Try it

#include <cuda_tile.h>
__global__ void my_kernel() {
  // Use tile-based programming constructs
}

Use NVIDIA CompileIQ for auto-tuning compiler optionsNewAuto

NVIDIA CompileIQ auto-tunes compiler options for optimal GPU performance.

ToolingAny NVIDIA GPU

NVIDIA CompileIQ tackles the problem of finding the best compiler options for a specific GPU, unlocking performance potential automatically.

Try it

nvcc -autotune option

Develop GPU kernels with NVIDIA CUDA TileNewAuto

NVIDIA CUDA Tile allows developing optimized GPU kernels using tile-based programming in C++.

ToolingAny NVIDIA GPU

NVIDIA CUDA Tile programming enables the development of highly optimized GPU kernels within existing C++ codebases, improving performance.

Try it

nvcc -tile my_kernel.cu

Using LLMs for reviewing kernel patchesNewAuto

Large language models are being used to review kernel patches.

ToolingCPU only

The kernel community has been exploring the use of large language models (LLMs) for reviewing kernel patches, which could potentially improve the efficiency and accuracy of the review process.

Try it

# Example command to use LLM for patch review (hypothetical)

AI assistance in Linux kernel developmentNewAuto

AI tools like GitHub Copilot and Claude Code are being used to generate or co-author patches for the Linux kernel.

ToolingCPU only

This indicates a growing trend in leveraging AI for software development, specifically in the context of kernel development, which can help improve efficiency and reduce the maintenance burden.

Try it

git apply < generated-patch-file.patch

AI Assistance in Generating Linux Kernel PatchesNewAuto

GitHub Copilot and Claude Code are being used to generate or co-author Linux kernel patches.

ToolingCPU only

This indicates an increasing trend in the use of AI for software development, particularly in the context of kernel development, which can help automate and speed up the process of fixing issues.

Try it

git apply < generated_patch_file.patch

Get real-time visibility into GPU usage across Kubernetes clustersNewAuto

Maximize AI infrastructure value with deep visibility into GPU utilization

ToolingAny GPU

NVIDIA's solution provides real-time visibility into GPU usage across Kubernetes clusters, which is essential for optimizing resource allocation and maximizing the value of AI infrastructure.

Try it

# Placeholder command, actual implementation depends on NVIDIA's monitoring tools

Use AI fuzzing tools for uncovering Linux kernel bugsNewAuto

AI fuzzing tools can help identify bugs in the Linux kernel

ToolingCPU only

Linux's second-in-command Greg Kroah-Hartman has been leveraging new AI fuzzing tools for uncovering Linux kernel bugs, highlighting the effectiveness of AI in identifying critical issues in system software.

Try it

# Example command to run AI fuzzing tool (hypothetical)

Linux Kernel Adds Documentation for Responsible AI UseAuto

New documentation in Linux 7.1 kernel focuses on responsible AI use for finding kernel bugs.

ToolingCPU only

This documentation provides guidelines on what qualifies as a security bug and how AI can be used responsibly to identify kernel bugs, which can help developers improve the security and reliability of their systems.

Try it

cat /usr/src/linux-7.1/Documentation/admin-guide/ai-bugs.rst

Document what qualifies as a security bug in AIAuto

Linux 7.1 kernel adds documentation for responsible AI use in finding kernel bugs.

ToolingCPU only

This documentation helps developers understand the criteria for what constitutes a security bug when using AI to find kernel bugs, promoting responsible AI practices.

Try it

cat /usr/src/linux-7.1/Documentation/security/responsible-ai-use.rst

Optimize GPU fleet with NVIDIA Fleet IntelligenceAuto

NVIDIA Fleet Intelligence provides real-time visibility and optimization for GPU fleets

ToolingNVIDIA GPUs

NVIDIA Fleet Intelligence can help maximize the utilization and efficiency of large GPU fleets. It provides insights into GPU usage and performance, enabling better resource allocation and optimization.

Try it

nvidia-fleet-intelligence --help

Use NVIDIA Fleet Intelligence for GPU fleet optimizationAuto

NVIDIA Fleet Intelligence provides real-time visibility and optimization for GPU fleets

ToolingNVIDIA GPUs

NVIDIA Fleet Intelligence offers tools for monitoring and optimizing large GPU fleets, enabling efficient resource utilization and performance.

Try it

nvidia-fleet-intelligence --query --update

Improve Linux per-core I/O performance with new patchesAuto

Jens Axboe is working on Linux patches to significantly improve per-core I/O performance.

ToolingCPU only

Following a presentation at the Linux storage, file-system, memory management and BPF summit, Jens Axboe was motivated to improve Linux I/O overhead compared to the Storage Performance Development Kit (SPDK), aiming for a 60% increase in per-core I/O performance.

Try it

# Apply the patches to the Linux kernel
git apply /path/to/axboe-patches

Utilize NVIDIA Dynamo for structured agentic exchangesAuto

NVIDIA Dynamo supports multi-turn agentic harness for preserving structured interactions.

ToolingAny NVIDIA GPU

NVIDIA Dynamo introduces multi-turn agentic harness support, allowing assistant turns to interleave reasoning with tool calls, and user turns to return responses. This is beneficial for applications requiring structured interactions and can be implemented on any NVIDIA GPU.

Try it

nvidia_dynamo_agentic_exchange

NVIDIA Releases CUDA-Oxide for Rust-To-CUDA CompilationAuto

Enables developers to use Rust for developing CUDA kernels for NVIDIA GPUs.

ToolingAny NVIDIA GPU

NVIDIA Labs project CUDA-Oxide 0.1 allows Rust to be used for developing CUDA kernels, potentially improving developer productivity and kernel performance.

Try it

cargo install cuda-oxide

Monitor real-time performance and debug with NCCL Inspector and PrometheusAuto

Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). NCCL Inspector and Prometheus can be used for real-time performance monitoring and faster debugging.

ToolingAny NVIDIA GPU

NCCL Inspector is a tool that can be used to monitor the performance of NCCL in real-time. It can help identify bottlenecks and issues in the communication between GPUs, leading to faster debugging and optimization of distributed deep learning workloads.

Try it

ncclInspector --log /path/to/logfile

Monitor GPU-to-GPU communication with NCCL Inspector and PrometheusAuto

Use NCCL Inspector and Prometheus for real-time performance monitoring and faster debugging in distributed deep learning

ToolingNVIDIA GPUs

Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). NCCL Inspector and Prometheus provide real-time performance monitoring and faster debugging.

Try it

ncclInspector --log <log_file>

Utilize ROCm 7.2.3 for AMD GPU compute and AI stack enhancementsAuto

ROCm 7.2.3 offers minor improvements for AMD GPU compute and AI stack.

ToolingAMD GPUs

ROCm 7.2.3 provides minor updates to the open-source AMD GPU compute and AI stack, which can be beneficial for developers working on AI applications that leverage AMD GPUs.

Try it

module load rocm/7.2.3

Utilize ROCm 7.2.3 for AMD GPU compute and AI stack enhancementsAuto

This release includes minor improvements to the open-source AMD GPU compute and AI stack.

ToolingAMD GPUs

ROCm 7.2.3 offers minor updates to the open-source AMD GPU compute and AI stack, which can be beneficial for developers working with AMD GPUs to enhance their AI applications.

Try it

# Example command to install ROCm 7.2.3
sudo apt install rocm-dkms rocm-utils

GCC 16 compiler delivers performance gains over GCC 15Auto

GCC 16.1 compiler shows performance improvements over GCC 15

ToolingCPU only

The GCC 16.1 compiler has been released with new changes and performance gains over GCC 15, which can benefit AI development by improving compilation times and efficiency.

Try it

gcc --version

AI tooling influences an increase in kernel patchesAuto

The 7.1 kernel prepatch suggests AI tooling is contributing to more patches than usual.

ToolingN/A

The article mentions that the 7.1 kernel prepatch has more patches than usual, likely due to AI tooling. This implies that AI tooling is becoming more prevalent in kernel development, which could lead to more efficient and effective kernel patches.

Try it

N/A

Leverage AMD GAIA for building local AI agentsAuto

AMD GAIA simplifies building AI agents on your PC with local AI processing.

ToolingCPU only

AMD GAIA (Generative AI Is Awesome) is an open-source software that leverages the Lemonade SDK to facilitate the creation of AI agents on Windows and Linux PCs. This can be beneficial for developers looking to build AI applications that run locally without relying on cloud-based services.

Try it

git clone https://github.com/AMD-GAIA/gaia && cd gaia && ./build.sh

Use AMD's GAIA for local AI development on Windows and LinuxAuto

AMD's GAIA simplifies building AI agents on PCs with local AI processing.

ToolingCPU only

AMD's GAIA (Generative AI Is Awesome) is an open-source software that leverages the Lemonade SDK to facilitate the creation of AI agents on PCs with local AI processing capabilities. This can be beneficial for developers looking to build AI applications without relying on cloud-based services, enhancing privacy and reducing latency.

Try it

git clone https://github.com/AMD-GAIA/GAIA.git && cd GAIA && ./build.sh

Automate GPU kernel translation with AI agentsAuto

Use AI to translate GPU kernels from Python to other languages, such as Julia, to expand the reach of your GPU-accelerated code.

ToolingAny GPU

NVIDIA's blog post discusses the use of AI agents to automate the translation of GPU kernels from Python to Julia, showcasing the potential for AI to assist in code portability and expansion across different programming environments.

Try it

# Example command to translate cuTile Python to cuTile.jl
# This is a conceptual representation and may not be directly executable
aitool translate --input cuTile.py --output cuTile.jl

Utilize NVIDIA CUDA Tile for tile-based GPU programmingAuto

NVIDIA CUDA Tile (cuTile) enables writing GPU kernels in terms of tile-level operations for better performance.

ToolingAny NVIDIA GPU

cuTile is a tile-based programming model that helps developers write GPU kernels focusing on tile-level operations such as loads, stores, and computations. This can lead to better performance and efficiency when programming for GPUs.

Try it

# Example of using cuTile for GPU kernel development
# This is a conceptual representation and not actual code
from cutile import Tile

# Define tile dimensions
tile = Tile(16, 16)

# Perform tile-based operations
# ...