AI Tips
Practical ways to train, run, or shrink AI models — explained for people new to AI. 19 new in last 30d.
New here? Each card answers one question: what is this and why should I care? Click a card to read the full explanation, including any new words. The command at the bottom is what you would type to try it on your own machine.
The AI flow — where each tip fits
Read left to rightAn AI model goes through these five phases. Click a phase to see the tips that apply there.
- 1Pre-training
A model first learns language by reading huge amounts of text. This costs millions of dollars and runs on thousands of GPUs.
- 2Fine-tuning
You take that pre-trained model and teach it your own data, your own task, or your own writing style. Hours to days, on a few GPUs.
e.g. QLoRA · Unsloth · DeepSpeed ZeRO-3 / FSDP
See Training tips → - 3Preference tuning
After fine-tuning, you teach the model which answers humans prefer. This makes it polite, helpful, and on-topic.
e.g. DPO / GRPO / KTO
See Training tips → - 4Quantization
The trained model is huge. Quantization shrinks it about 4× by storing its numbers with less precision, so it fits on cheap hardware.
e.g. GGUF + llama.cpp · AWQ / GPTQ · EXL2
See Quantization tips → - 5Inference / serving
Running the model so users can ask it questions. This is what your app actually does in production.
e.g. vLLM (PagedAttention) · Ollama · Speculative decoding
See Inference tips →
Liger Kernel — about 20% less training VRAM, one import lineReplaces a few slow parts of the model with faster code. You add one line and your training uses 20% less GPU memory.
Toolingany training rig
When a model trains, certain math operations create big temporary tensors (RMSNorm, SwiGLU, RoPE, the loss function). Liger Kernel — open-sourced by LinkedIn — rewrites these in Triton so the temporaries never get created. The result: about 20% less VRAM on Llama-3 training, up to 30% less on long-context runs, no change in accuracy. Works alongside FSDP and LoRA. One import line and you are done.
Try it
pip install liger-kernel && python -c "from liger_kernel.transformers import apply_liger_kernel_to_llama; apply_liger_kernel_to_llama()"Ollama — one command to run any model on your laptopLike 'docker run' but for AI models. Auto-downloads, picks a good size for your machine, and exposes a local API.
ToolingMac / any GPU
Ollama wraps llama.cpp in a clean command-line tool plus a local HTTP API on port 11434 that speaks the OpenAI format. You type one command and it downloads the model, picks a quantized size that fits your machine, and starts serving. Perfect for testing prototypes, building agents, and anything on a Mac with Apple Silicon. The Modelfile lets you bake in a system prompt so the model behaves the way you want every time.
Try it
curl -fsSL https://ollama.com/install.sh | sh && ollama run llama3.1:70b-instruct-q4_K_MFlashAttention 3 — about 2× faster attention on H100 / BlackwellA faster version of the most expensive math step inside an AI model. Free speed-up on the newest NVIDIA cards.
ToolingH100 / B100 / B200 / 5090
Attention is the math step that lets the model look at every word it has read so far. Done naively it reads the GPU's slow memory many times per word. FlashAttention rewrites the same math so it streams data through the GPU's tiny fast memory once. Version 3 adds support for FP8 (8-bit math) on H100 and Blackwell GPUs, which doubles throughput compared to FP16. PyTorch 2.5 picks it up automatically when you call scaled_dot_product_attention on supported hardware. The biggest free perf win you can flip on a transformer.
Try it
pip install flash-attn --no-build-isolationVulkan Video support in Firefox for GPU-accelerated video decodingNewAutoMozilla Firefox now supports Vulkan Video for GPU-accelerated video decoding
ToolingVulkan compatible GPU
This feature allows Firefox to leverage the Vulkan API for video decoding, potentially improving performance and efficiency on systems with Vulkan-compatible GPUs.
Try it
firefox --enable-features=VulkanVideoDecodeNVIDIA Releases CUDA-Oxide for Rust-To-CUDA CompilerNewAutoExperimental Rust-to-CUDA compiler for safer GPU kernel development
ToolingNVIDIA GPUs
CUDA-Oxide allows developers to write CUDA GPU kernels in Rust, providing a safer alternative to traditional CUDA C/C++ development. This can lead to more robust and maintainable GPU code.
Try it
rustc my_cuda_kernel.rs --crate-name cuda_oxideOptimize Kernel Tracing for Function ParametersNewAutoImprove kernel tracing by considering unused function parameters.
ToolingCPU only
Optimizing compilers may remove unused function parameters, which can affect kernel tracing or BPF subsystems. To address this, consider the implications of such optimizations on tracing functionality and adjust accordingly.
Try it
# Example of optimizing kernel tracing
# This is a placeholder command as the actual implementation depends on the specific kernel function and tracing setup.
# Please refer to the source article for detailed instructions.
echo 'Optimize kernel tracing for unused parameters'Advance AI infrastructure with NVIDIA DOCA In-Silicon SecurityNewAutoUtilize NVIDIA DOCA for enhancing AI infrastructure with in-silicon security.
ToolingNVIDIA DOCA
NVIDIA DOCA introduces a new class of infrastructure for AI, transforming data into intelligence for autonomous AI agents with unprecedented security and efficiency, making it ideal for building AI factories.
Try it
# Example command for using NVIDIA DOCA
# This is a placeholder command and may vary based on actual usage
doca_command --option <option_value>Automate AI model documentation with NVIDIA MCG ToolkitNewAutoUse NVIDIA MCG Toolkit to streamline AI model documentation.
ToolingAny NVIDIA GPU
The NVIDIA MCG Toolkit automates AI model documentation, which is crucial for regulatory compliance and understanding model complexity.
Try it
nvidia-mcg-toolkit --helpAutomate AI model documentation with NVIDIA MCG ToolkitNewAutoUse NVIDIA MCG Toolkit to automate AI model documentation for regulatory compliance.
ToolingN/A
As AI models grow in complexity, regulatory scrutiny intensifies. The NVIDIA MCG Toolkit automates AI model documentation, helping teams comply with frameworks like California's AB-2013 and the EU AI Act.
Try it
N/ARun Step 3.7 Flash on NVIDIA GPUs for multimodal AINewAutoDeploy multimodal AI applications on NVIDIA GPUs using Step 3.7 Flash.
ToolingAny NVIDIA GPU
AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and text. Step 3.7 Flash on NVIDIA GPUs enables this.
Try it
nvidia-smiAutomate AI model documentation with NVIDIA MCG ToolkitNewAutoUse NVIDIA MCG Toolkit to automate AI model documentation for regulatory compliance.
ToolingCPU only
The NVIDIA MCG Toolkit helps automate AI model documentation, which is crucial for regulatory scrutiny under frameworks like California’s AB-2013 and the EU AI Act.
Try it
nvidia-mcg --helpLeverage CUDA Python 1.0 for unified GPU programmingNewAutoCUDA 13.3 introduces CUDA Python 1.0 for easier GPU programming
ToolingAny NVIDIA GPU
CUDA 13.3 includes the release of CUDA Python 1.0, which provides a unified GPU programming model. This allows developers to write Python code that can run on both CPUs and GPUs, simplifying the development of GPU-accelerated applications.
Try it
import cupy as cpCUDA Python 1.0 released for unified GPU programmingNewAutoSimplify GPU programming with CUDA Python 1.0
ToolingNVIDIA GPU
CUDA 13.3 introduces CUDA Python 1.0, which provides a familiar Python-based interface for GPU programming, making it easier to develop and deploy GPU-accelerated applications across various domains.
Try it
import cupy as cpCUDA 13.3 Enhances GPU Development with Tile Programming and AutotuningNewAutoCUDA 13.3 introduces tile programming in C++, compiler autotuning, and Python updates.
ToolingAny NVIDIA GPU
NVIDIA CUDA 13.3 brings new capabilities and performance optimizations, including tile programming and compiler autotuning, benefiting developers across the CUDA ecosystem.
Try it
nvcc -tile my_kernel.cuDevelop High-Performance GPU Kernels with CUDA TileNewAutoNVIDIA CUDA Tile allows optimizing GPU kernels using tile-based programming in C++.
ToolingAny NVIDIA GPU
NVIDIA CUDA Tile programming enables developers to write highly optimized GPU kernels within existing C++ codebases, improving performance.
Try it
#include <cuda_tile.h>
__global__ void my_kernel() {
// Use tile-based programming constructs
}Use NVIDIA CompileIQ for auto-tuning compiler optionsNewAutoNVIDIA CompileIQ auto-tunes compiler options for optimal GPU performance.
ToolingAny NVIDIA GPU
NVIDIA CompileIQ tackles the problem of finding the best compiler options for a specific GPU, unlocking performance potential automatically.
Try it
nvcc -autotune optionDevelop GPU kernels with NVIDIA CUDA TileNewAutoNVIDIA CUDA Tile allows developing optimized GPU kernels using tile-based programming in C++.
ToolingAny NVIDIA GPU
NVIDIA CUDA Tile programming enables the development of highly optimized GPU kernels within existing C++ codebases, improving performance.
Try it
nvcc -tile my_kernel.cuUsing LLMs for reviewing kernel patchesNewAutoLarge language models are being used to review kernel patches.
ToolingCPU only
The kernel community has been exploring the use of large language models (LLMs) for reviewing kernel patches, which could potentially improve the efficiency and accuracy of the review process.
Try it
# Example command to use LLM for patch review (hypothetical)AI assistance in Linux kernel developmentNewAutoAI tools like GitHub Copilot and Claude Code are being used to generate or co-author patches for the Linux kernel.
ToolingCPU only
This indicates a growing trend in leveraging AI for software development, specifically in the context of kernel development, which can help improve efficiency and reduce the maintenance burden.
Try it
git apply < generated-patch-file.patchAI Assistance in Generating Linux Kernel PatchesNewAutoGitHub Copilot and Claude Code are being used to generate or co-author Linux kernel patches.
ToolingCPU only
This indicates an increasing trend in the use of AI for software development, particularly in the context of kernel development, which can help automate and speed up the process of fixing issues.
Try it
git apply < generated_patch_file.patchGet real-time visibility into GPU usage across Kubernetes clustersNewAutoMaximize AI infrastructure value with deep visibility into GPU utilization
ToolingAny GPU
NVIDIA's solution provides real-time visibility into GPU usage across Kubernetes clusters, which is essential for optimizing resource allocation and maximizing the value of AI infrastructure.
Try it
# Placeholder command, actual implementation depends on NVIDIA's monitoring toolsUse AI fuzzing tools for uncovering Linux kernel bugsNewAutoAI fuzzing tools can help identify bugs in the Linux kernel
ToolingCPU only
Linux's second-in-command Greg Kroah-Hartman has been leveraging new AI fuzzing tools for uncovering Linux kernel bugs, highlighting the effectiveness of AI in identifying critical issues in system software.
Try it
# Example command to run AI fuzzing tool (hypothetical)Linux Kernel Adds Documentation for Responsible AI UseAutoNew documentation in Linux 7.1 kernel focuses on responsible AI use for finding kernel bugs.
ToolingCPU only
This documentation provides guidelines on what qualifies as a security bug and how AI can be used responsibly to identify kernel bugs, which can help developers improve the security and reliability of their systems.
Try it
cat /usr/src/linux-7.1/Documentation/admin-guide/ai-bugs.rstDocument what qualifies as a security bug in AIAutoLinux 7.1 kernel adds documentation for responsible AI use in finding kernel bugs.
ToolingCPU only
This documentation helps developers understand the criteria for what constitutes a security bug when using AI to find kernel bugs, promoting responsible AI practices.
Try it
cat /usr/src/linux-7.1/Documentation/security/responsible-ai-use.rstOptimize GPU fleet with NVIDIA Fleet IntelligenceAutoNVIDIA Fleet Intelligence provides real-time visibility and optimization for GPU fleets
ToolingNVIDIA GPUs
NVIDIA Fleet Intelligence can help maximize the utilization and efficiency of large GPU fleets. It provides insights into GPU usage and performance, enabling better resource allocation and optimization.
Try it
nvidia-fleet-intelligence --helpUse NVIDIA Fleet Intelligence for GPU fleet optimizationAutoNVIDIA Fleet Intelligence provides real-time visibility and optimization for GPU fleets
ToolingNVIDIA GPUs
NVIDIA Fleet Intelligence offers tools for monitoring and optimizing large GPU fleets, enabling efficient resource utilization and performance.
Try it
nvidia-fleet-intelligence --query --updateImprove Linux per-core I/O performance with new patchesAutoJens Axboe is working on Linux patches to significantly improve per-core I/O performance.
ToolingCPU only
Following a presentation at the Linux storage, file-system, memory management and BPF summit, Jens Axboe was motivated to improve Linux I/O overhead compared to the Storage Performance Development Kit (SPDK), aiming for a 60% increase in per-core I/O performance.
Try it
# Apply the patches to the Linux kernel
git apply /path/to/axboe-patchesUtilize NVIDIA Dynamo for structured agentic exchangesAutoNVIDIA Dynamo supports multi-turn agentic harness for preserving structured interactions.
ToolingAny NVIDIA GPU
NVIDIA Dynamo introduces multi-turn agentic harness support, allowing assistant turns to interleave reasoning with tool calls, and user turns to return responses. This is beneficial for applications requiring structured interactions and can be implemented on any NVIDIA GPU.
Try it
nvidia_dynamo_agentic_exchangeNVIDIA Releases CUDA-Oxide for Rust-To-CUDA CompilationAutoEnables developers to use Rust for developing CUDA kernels for NVIDIA GPUs.
ToolingAny NVIDIA GPU
NVIDIA Labs project CUDA-Oxide 0.1 allows Rust to be used for developing CUDA kernels, potentially improving developer productivity and kernel performance.
Try it
cargo install cuda-oxideMonitor real-time performance and debug with NCCL Inspector and PrometheusAutoDistributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). NCCL Inspector and Prometheus can be used for real-time performance monitoring and faster debugging.
ToolingAny NVIDIA GPU
NCCL Inspector is a tool that can be used to monitor the performance of NCCL in real-time. It can help identify bottlenecks and issues in the communication between GPUs, leading to faster debugging and optimization of distributed deep learning workloads.
Try it
ncclInspector --log /path/to/logfileMonitor GPU-to-GPU communication with NCCL Inspector and PrometheusAutoUse NCCL Inspector and Prometheus for real-time performance monitoring and faster debugging in distributed deep learning
ToolingNVIDIA GPUs
Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). NCCL Inspector and Prometheus provide real-time performance monitoring and faster debugging.
Try it
ncclInspector --log <log_file>Utilize ROCm 7.2.3 for AMD GPU compute and AI stack enhancementsAutoROCm 7.2.3 offers minor improvements for AMD GPU compute and AI stack.
ToolingAMD GPUs
ROCm 7.2.3 provides minor updates to the open-source AMD GPU compute and AI stack, which can be beneficial for developers working on AI applications that leverage AMD GPUs.
Try it
module load rocm/7.2.3Utilize ROCm 7.2.3 for AMD GPU compute and AI stack enhancementsAutoThis release includes minor improvements to the open-source AMD GPU compute and AI stack.
ToolingAMD GPUs
ROCm 7.2.3 offers minor updates to the open-source AMD GPU compute and AI stack, which can be beneficial for developers working with AMD GPUs to enhance their AI applications.
Try it
# Example command to install ROCm 7.2.3
sudo apt install rocm-dkms rocm-utilsGCC 16 compiler delivers performance gains over GCC 15AutoGCC 16.1 compiler shows performance improvements over GCC 15
ToolingCPU only
The GCC 16.1 compiler has been released with new changes and performance gains over GCC 15, which can benefit AI development by improving compilation times and efficiency.
Try it
gcc --versionAI tooling influences an increase in kernel patchesAutoThe 7.1 kernel prepatch suggests AI tooling is contributing to more patches than usual.
ToolingN/A
The article mentions that the 7.1 kernel prepatch has more patches than usual, likely due to AI tooling. This implies that AI tooling is becoming more prevalent in kernel development, which could lead to more efficient and effective kernel patches.
Try it
N/ALeverage AMD GAIA for building local AI agentsAutoAMD GAIA simplifies building AI agents on your PC with local AI processing.
ToolingCPU only
AMD GAIA (Generative AI Is Awesome) is an open-source software that leverages the Lemonade SDK to facilitate the creation of AI agents on Windows and Linux PCs. This can be beneficial for developers looking to build AI applications that run locally without relying on cloud-based services.
Try it
git clone https://github.com/AMD-GAIA/gaia && cd gaia && ./build.shUse AMD's GAIA for local AI development on Windows and LinuxAutoAMD's GAIA simplifies building AI agents on PCs with local AI processing.
ToolingCPU only
AMD's GAIA (Generative AI Is Awesome) is an open-source software that leverages the Lemonade SDK to facilitate the creation of AI agents on PCs with local AI processing capabilities. This can be beneficial for developers looking to build AI applications without relying on cloud-based services, enhancing privacy and reducing latency.
Try it
git clone https://github.com/AMD-GAIA/GAIA.git && cd GAIA && ./build.shAutomate GPU kernel translation with AI agentsAutoUse AI to translate GPU kernels from Python to other languages, such as Julia, to expand the reach of your GPU-accelerated code.
ToolingAny GPU
NVIDIA's blog post discusses the use of AI agents to automate the translation of GPU kernels from Python to Julia, showcasing the potential for AI to assist in code portability and expansion across different programming environments.
Try it
# Example command to translate cuTile Python to cuTile.jl
# This is a conceptual representation and may not be directly executable
aitool translate --input cuTile.py --output cuTile.jlUtilize NVIDIA CUDA Tile for tile-based GPU programmingAutoNVIDIA CUDA Tile (cuTile) enables writing GPU kernels in terms of tile-level operations for better performance.
ToolingAny NVIDIA GPU
cuTile is a tile-based programming model that helps developers write GPU kernels focusing on tile-level operations such as loads, stores, and computations. This can lead to better performance and efficiency when programming for GPUs.
Try it
# Example of using cuTile for GPU kernel development
# This is a conceptual representation and not actual code
from cutile import Tile
# Define tile dimensions
tile = Tile(16, 16)
# Perform tile-based operations
# ...