AI Tips

Practical ways to train, run, or shrink AI models — explained for people new to AI. 52 new in last 30d.

New here? Each card answers one question: what is this and why should I care? Click a card to read the full explanation, including any new words. The command at the bottom is what you would type to try it on your own machine.

The AI flow — where each tip fits

Read left to right

An AI model goes through these five phases. Click a phase to see the tips that apply there.

1Pre-training
A model first learns language by reading huge amounts of text. This costs millions of dollars and runs on thousands of GPUs.
2Fine-tuning
You take that pre-trained model and teach it your own data, your own task, or your own writing style. Hours to days, on a few GPUs.
e.g. QLoRA · Unsloth · DeepSpeed ZeRO-3 / FSDP
See Training tips →
3Preference tuning
After fine-tuning, you teach the model which answers humans prefer. This makes it polite, helpful, and on-topic.
e.g. DPO / GRPO / KTO
See Training tips →
4Quantization
The trained model is huge. Quantization shrinks it about 4× by storing its numbers with less precision, so it fits on cheap hardware.
e.g. GGUF + llama.cpp · AWQ / GPTQ · EXL2
See Quantization tips →
5Inference / serving
Running the model so users can ask it questions. This is what your app actually does in production.
e.g. vLLM (PagedAttention) · Ollama · Speculative decoding
See Inference tips →

Live

Pick the right serving stack — vLLM vs TGI vs Aphrodite

Three popular ways to host a model behind an API. vLLM is the safe default. TGI fits Hugging Face shops. Aphrodite is for heavy sampling.

Inferenceproduction GPU node

A serving stack is the program that turns a model into a web API. vLLM has the best throughput and an OpenAI-compatible API — it should be the default for new deployments. TGI (Text Generation Inference) is Hugging Face's, integrates nicely with their Inference Endpoints, and supports speculative decoding out of the box. Aphrodite is a vLLM fork tuned for high-temperature sampling, multi-LoRA, and role-play workloads. If you are unsure: start with vLLM.

Try it

# vLLM (most teams)
vllm serve mistralai/Mistral-Nemo-Instruct-2407
# TGI (HF infra)
docker run --gpus all ghcr.io/huggingface/text-generation-inference --model-id mistralai/Mistral-Nemo-Instruct-2407

Liger Kernel — about 20% less training VRAM, one import line

Replaces a few slow parts of the model with faster code. You add one line and your training uses 20% less GPU memory.

Toolingany training rig

When a model trains, certain math operations create big temporary tensors (RMSNorm, SwiGLU, RoPE, the loss function). Liger Kernel — open-sourced by LinkedIn — rewrites these in Triton so the temporaries never get created. The result: about 20% less VRAM on Llama-3 training, up to 30% less on long-context runs, no change in accuracy. Works alongside FSDP and LoRA. One import line and you are done.

Try it

pip install liger-kernel && python -c "from liger_kernel.transformers import apply_liger_kernel_to_llama; apply_liger_kernel_to_llama()"

DPO / GRPO / KTO — teach a model what good looks like

Modern ways to use human feedback to make a model prefer good answers over bad ones. Simpler than the old RLHF setup.

Training1× A100 80GB or QLoRA on 24GB

Classic RLHF (the method behind ChatGPT) trains a separate reward model first, then runs reinforcement learning. It works but it is complicated. DPO (Direct Preference Optimization) skips the reward model — you just give it pairs of (good answer, bad answer) and it directly adjusts the model. GRPO scales this to math and reasoning where you can verify correctness automatically (DeepSeek used it for their math model). KTO needs only single labels (was this answer good? yes/no), so you can use cheap data like in-app thumbs-up/down. All three are in the trl library.

Try it

pip install trl && python -m trl.scripts.dpo --model meta-llama/Llama-3.1-8B --dataset HuggingFaceH4/ultrafeedback_binarized

MLX — run big models on a Mac

Apple's framework that uses the Mac's shared memory. A 64 GB MacBook Pro can run a Llama-3 70B at usable speed.

Cheap GPUM2 / M3 / M4 Max 64GB+

On a normal PC the GPU has its own memory (VRAM) and the CPU has its own memory (RAM), and they have to copy data between them — that copy is slow. Apple Silicon Macs share one memory pool between CPU and GPU, so there is no copy. MLX is Apple's framework that takes advantage of this. A 64 GB M3 Max runs a 4-bit Llama-3 70B at about 10 tokens per second, which is usable for chat. The mlx-community on Hugging Face mirrors popular models pre-quantized for you.

Try it

pip install mlx-lm && mlx_lm.generate --model mlx-community/Llama-3.1-70B-Instruct-4bit --prompt 'hello'

Unsloth — 2× faster LoRA fine-tuning, half the VRAM

A drop-in library that makes fine-tuning twice as fast and uses half the GPU memory. Same result, less waiting.

TrainingRTX 3090 / 4090 24GB

Hugging Face's PEFT library is fine, but it is written in pure PyTorch which leaves performance on the table. Unsloth rewrites the LoRA forward and backward passes in Triton (NVIDIA's fast-kernel language). Result: the same loss curves, about 2× faster, about 50% less VRAM. Drop-in with Llama, Mistral, Phi, Gemma, Qwen — change a couple of import lines and it works. Their notebooks are a good starting point if you have never fine-tuned.

Try it

pip install unsloth && python -m unsloth.examples.llama3_8b_finetune

DeepSpeed ZeRO-3 / PyTorch FSDP — train models too big for one GPU

Splits a giant model across many GPUs during training. The way teams fine-tune 70B+ models on 8 cards.

Training8× A100 80GB or H100s

When you train a model, you also need memory for gradients, optimizer state, and activations — together about 4× the model itself. ZeRO is a method that shards (splits) those across all your GPUs so each card only stores a slice. ZeRO-3 also shards the model parameters themselves. Combined with mixed-precision (bf16) training and activation checkpointing, this lets a normal 8× A100 box train a 70B model. PyTorch FSDP is the in-tree alternative with the same idea.

Try it

deepspeed --num_gpus 8 train.py --deepspeed ds_config_zero3.json

Ollama — one command to run any model on your laptop

Like 'docker run' but for AI models. Auto-downloads, picks a good size for your machine, and exposes a local API.

ToolingMac / any GPU

Ollama wraps llama.cpp in a clean command-line tool plus a local HTTP API on port 11434 that speaks the OpenAI format. You type one command and it downloads the model, picks a quantized size that fits your machine, and starts serving. Perfect for testing prototypes, building agents, and anything on a Mac with Apple Silicon. The Modelfile lets you bake in a system prompt so the model behaves the way you want every time.

Try it

curl -fsSL https://ollama.com/install.sh | sh && ollama run llama3.1:70b-instruct-q4_K_M

Mixture of Experts (MoE) — huge model, fast model latency

The model is a team of small experts. For each word it only uses two of them. So a 141B model answers as fast as a 39B one.

Techniquedepends on top-k

A normal ('dense') model uses all of its parameters for every word. An MoE replaces one block with several smaller experts, plus a router that picks the top-k experts (usually 2) per word. So Mixtral-8x22B has 141B total parameters but only ~39B are active per word — answers come at the speed of a 39B model. The trade-off: you still have to load all experts in GPU memory, so VRAM is high. DeepSeek-V3 and Mixtral are the well-known open MoEs.

Try it

vllm serve mistralai/Mixtral-8x22B-Instruct-v0.1 --tensor-parallel-size 4

FlashAttention 3 — about 2× faster attention on H100 / Blackwell

A faster version of the most expensive math step inside an AI model. Free speed-up on the newest NVIDIA cards.

ToolingH100 / B100 / B200 / 5090

Attention is the math step that lets the model look at every word it has read so far. Done naively it reads the GPU's slow memory many times per word. FlashAttention rewrites the same math so it streams data through the GPU's tiny fast memory once. Version 3 adds support for FP8 (8-bit math) on H100 and Blackwell GPUs, which doubles throughput compared to FP16. PyTorch 2.5 picks it up automatically when you call scaled_dot_product_attention on supported hardware. The biggest free perf win you can flip on a transformer.

Try it

pip install flash-attn --no-build-isolation

Speculative decoding — make the big model faster, for free

A tiny model guesses the next words. The big model just checks the guesses in one batch. 2–3× faster, same answer quality.

Techniqueany inference target

Big models are slow because they generate one word (or token) at a time. With speculative decoding you also load a small, cheap 'draft' model. The draft model writes the next 5 tokens. The big model then runs once and checks all 5 in parallel. Tokens it agrees with are kept; tokens it does not agree with are replaced. The final answer is identical to what the big model would have written alone — you just spent fewer expensive runs to get there. Built into vLLM, TGI, and llama.cpp.

Try it

vllm serve meta-llama/Llama-3.1-70B --speculative-model meta-llama/Llama-3.2-1B --num-speculative-tokens 5

vLLM — serve a model 5–10× faster than the basic library

A serving engine that handles many requests at the same time without wasting GPU memory. The default choice for production.

Inference1× A100 / H100 / 4090

When a model answers, it has to remember what it has already written — this is called the KV cache and it eats a lot of GPU memory. The basic Hugging Face library reserves a fixed chunk for each request, which leaves a lot of memory unused. vLLM borrows the idea of virtual memory from operating systems: it splits the cache into small pages and reuses them, so the same GPU can serve 5–10× more requests at the same time. It also exposes an OpenAI-compatible HTTP API, so any client that talks to OpenAI can talk to vLLM with one URL change.

Try it

pip install vllm && vllm serve meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 1

EXL2 — the fastest way to run shrunk models on a gaming GPU

Picks the best precision for each part of the model. About 2× faster than the GGUF format on a single RTX card.

InferenceRTX 3090 / 4090 / 5090

EXL2 is a quantization format made for ExLlamaV2, a runtime built only for NVIDIA GPUs (no CPU offload). It is smarter than the others because it spends more bits on the important parts of the model and fewer bits on the rest, hitting an average target like 4.0 bits per weight. On one RTX 4090 a Llama-3 70B in EXL2 4.0bpw runs at about 20 tokens per second — roughly twice the speed of an equivalent GGUF setup. Use this when you need pure speed on a single card.

Try it

pip install exllamav2 && python examples/inference.py -m models/llama-3-70b-exl2-4.0bpw -p 'hello'

AWQ / GPTQ — shrink a model 4× without losing quality

Smarter compression that picks which numbers matter most and keeps those accurate. The model takes a quarter of the memory but answers almost the same.

Quantizationany 16GB+ GPU

Quantization (storing numbers in 4-bit instead of 16-bit) usually loses some quality. AWQ — short for Activation-aware Weight Quantization — figures out which channels in the model are most important and protects them. The result: 4-bit AWQ keeps about 99% of the quality of the full model, but uses a quarter of the memory. GPTQ is the older method and slightly weaker. AWQ also runs faster on modern NVIDIA cards because vLLM and TGI ship optimized code for it.

Try it

python -c "from awq import AutoAWQForCausalLM; m = AutoAWQForCausalLM.from_quantized('casperhansen/llama-3-70b-awq')"

GGUF + llama.cpp — run a 70B model on a gaming PC

A file format that packs a model so it can split between your GPU and your normal computer memory. Lets a 70B model run on a 4090.

QuantizationRTX 4090 24GB (with offload)

GGUF is the file format llama.cpp uses to ship pre-shrunk (quantized) models. Quantization means storing the model's numbers with less precision so the file is smaller — Q4_K_M is the popular setting, about 4× smaller than the original. The clever part is layer offload: you tell llama.cpp how many layers to keep on the GPU (-ngl 35) and the rest stay in your normal computer RAM. So a 70B model that needs ~40 GB can split: 24 GB on a 4090 + 16 GB in system RAM. About 8 tokens per second on a 4090 with 64 GB DDR5.

Try it

llama-cli -m llama-3.1-70b-q4_k_m.gguf -ngl 35 -c 8192 -p "hello"

QLoRA — fine-tune a 70B model on one consumer GPU

Teach a giant model new skills using only 24 GB of GPU memory, instead of the 320 GB you would normally need.

TrainingRTX 3090 / 4090 24GB

Fine-tuning means starting from an already-trained model and teaching it your own data. A 70B-parameter model normally needs about 320 GB of GPU memory (VRAM) to fine-tune — that costs thousands per hour in the cloud. QLoRA does two clever things: it stores the original model in 4-bit numbers (about 4× smaller), and it only trains a tiny add-on called a LoRA adapter, not the whole model. The full setup fits on a single 24 GB gaming card with no real loss in quality. Result: home-lab fine-tuning is suddenly possible.

Try it

pip install bitsandbytes peft transformers && python -m peft.examples.qlora --model meta-llama/Llama-3.1-70B --bits 4

Boost Mixture-of-Experts training throughput with advanced fusion kernelsNewAuto

Increase training throughput for Mixture-of-Experts models using advanced fusion kernels

TrainingAny GPU

Advanced fusion kernels can significantly boost the training throughput of Mixture-of-Experts models, which are a key component in large-scale AI systems, by optimizing the communication between experts.

Try it

moe_model = MixtureOfExpertsModel()
advanced_fusion_kernels.optimize_training_throughput(moe_model)

Fine-tune biological foundation models with LoRA using NVIDIA BioNeMoNewAuto

Use NVIDIA BioNeMo recipes to fine-tune biological foundation models with LoRA

TrainingAny GPU

NVIDIA BioNeMo provides recipes for fine-tuning biological foundation models with LoRA, allowing for efficient and effective updates to these large models in the field of computational biology.

Try it

nvidia_bionemo_recipes = NVIDIABioNeMoRecipes()
lora_finetuned_model = nvidia_bionemo_recipes.fine_tune_biological_model(model, data)

Boost Mixture-of-Experts training throughput with advanced fusion kernelsNewAuto

Increase MoE model training throughput using advanced fusion kernels

TrainingRTX 3090 24GB

NVIDIA's blog post discusses how to boost the training throughput of Mixture-of-Experts (MoE) models by using advanced fusion kernels, which can significantly improve the efficiency of training large-scale AI systems.

Try it

python -m moe_train --advanced-fusion-kernels

Fine-tune biological foundation models with LoRA using NVIDIA BioNeMoNewAuto

Use NVIDIA BioNeMo recipes to fine-tune foundation models with LoRA for computational biology tasks

TrainingRTX 3090 24GB

NVIDIA BioNeMo provides recipes for fine-tuning large foundation models like ESM2 using LoRA, allowing for efficient and effective updates to these models for specific computational biology tasks.

Try it

python -m biodemo.lora_finetune --config-file config.yaml

Simplify AI infrastructure at the edge with Cisco and CanonicalNewAuto

Test-time inference is shifting to the edge to reduce latency and bandwidth consumption.

Cheap GPUCPU only

Legacy infrastructure was not designed for AI era requirements. Large-scale model training remains centralized in data centers, while test-time inference is moving to the edge to reduce latency and bandwidth consumption.

Try it

# Example command to deploy AI model on edge device
sudo docker run -d --name ai-edge-model my-ai-model:latest

Shift test-time inference to edge to reduce latency and bandwidthNewAuto

Move inference from centralized data centers to edge devices to improve performance

InferenceCPU only

Legacy infrastructure was not designed for AI requirements. Test-time inference is rapidly shifting to the edge to reduce latency and bandwidth consumption. This shift can significantly improve performance for applications requiring real-time responses.

Try it

sudo apt-get install edge-device-software

Simplify AI infrastructure at the edge with Cisco and CanonicalNewAuto

Test-time inference is shifting to the edge to reduce latency and bandwidth consumption

InferenceCPU only

Legacy infrastructure was not designed for AI era requirements. Large-scale model training remains centralized in data centers, but test-time inference is moving to the edge to reduce latency and bandwidth consumption.

Try it

# Example command to deploy AI model at the edge
sudo apt-get install -y edge-ai-model
edge-ai-model --deploy

Optimize AI inference at the edge to reduce latency and bandwidthNewAuto

Shifting test-time inference to the edge can enhance performance

InferenceCPU only

As large-scale model training remains centralized in data centers, test-time inference is rapidly shifting to the edge to reduce latency and bandwidth consumption, which is crucial for real-time AI applications.

Try it

# Example command to deploy model inference at the edge
# This is a placeholder and actual command depends on the specific edge device and software stack
edge_deploy_model.sh

Optimize AI infrastructure for edge deploymentNewAuto

Shift test-time inference to the edge to reduce latency and bandwidth consumption

Cheap GPUCPU only

Legacy infrastructure was not designed for AI requirements. Large-scale model training remains centralized in data centers, but test-time inference is rapidly shifting to the edge. This can help reduce latency and bandwidth consumption, which is crucial for real-time AI applications.

Try it

sudo apt-get install -y edge-ai-optimization-tool

Convert FP8 Checkpoints to NVIDIA TensorRT for High-Performance InferenceNewAuto

Bridge the gap between model optimization and production deployment

QuantizationNVIDIA GPU

This process enables faster inference by converting a quantized checkpoint into an NVIDIA TensorRT engine, which is crucial for deploying optimized models in production.

Try it

tensorrt_script = 'python post_training_quantization/convert_to_tensorrt.py --fp8_checkpoint /path/to/fp8/checkpoint --output /path/to/output/engine'

Convert FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRTNewAuto

Bridging the gap between model optimization and production deployment

QuantizationNVIDIA GPU

Converting a quantized checkpoint into an NVIDIA TensorRT engine enables faster inference, improving model performance and reducing latency.

Try it

tensorrt_script = 'python -m torch.ao.quantization.fx.prepare_qat_fx ' + '-m model_fp32 ' + '-m model_qat'

Train models faster with JAX and MaxText using NVFP4 on NVIDIA BlackwellNewAuto

Improve throughput when training large language models with JAX and MaxText on NVIDIA Blackwell

TrainingNVIDIA Blackwell

When training spans trillions of tokens across thousands of accelerators, every percentage point of step improvement matters. Using JAX and MaxText with NVFP4 on NVIDIA Blackwell can help achieve better throughput, which is crucial for pre-training frontier LLMs.

Try it

# Example command using JAX and MaxText
# This is a placeholder command and should be replaced with the actual usage
jax.run(your_model_training_function)

Train models faster with JAX and MaxText using NVFP4 on NVIDIA BlackwellNewAuto

Improve throughput when training large language models with JAX and MaxText on NVIDIA Blackwell

TrainingNVIDIA Blackwell

When training spans trillions of tokens across thousands of accelerators, every percentage point of step improvement matters. Using JAX with MaxText and NVFP4 on NVIDIA Blackwell can significantly improve throughput, leading to faster training times for large language models.

Try it

jax.run(your_model, your_data, max_text=True, nvfp4=True)

GPU-accelerated video decoding with Vulkan in FirefoxNewAuto

Firefox now supports Vulkan Video for improved video decoding performance

InferenceVulkan compatible GPU

Mozilla Firefox has merged initial support for Vulkan Video, which allows for GPU-accelerated video decoding. This can lead to performance improvements and reduced CPU usage when playing videos in the browser, especially on systems with Vulkan-compatible GPUs.

Try it

# Enable Vulkan Video decoding in Firefox
# This may require enabling specific settings or flags in the browser

Vulkan Video support in Firefox for GPU-accelerated video decodingNewAuto

Mozilla Firefox now supports Vulkan Video for GPU-accelerated video decoding

ToolingVulkan compatible GPU

This feature allows Firefox to leverage the Vulkan API for video decoding, potentially improving performance and efficiency on systems with Vulkan-compatible GPUs.

Try it

firefox --enable-features=VulkanVideoDecode

Optimize AI energy consumption with Ubuntu 26.04 LTSNewAuto

Ubuntu 26.04 LTS focuses on reducing energy consumption for AI workloads, which is crucial for cost and sustainability.

Cheap GPUCPU only

Ubuntu 26.04 LTS is designed to maximize the value extracted from GPU clusters by focusing on energy efficiency, measured in tokens per watt (TpW). This metric helps CEOs and infrastructure teams manage the cost of AI workloads more effectively.

Try it

sudo apt-get install ubuntu-26.04-lts

Google introduces Gemma 4 for QAT modelsNewAuto

Optimizing model compression for mobile and laptop efficiency

QuantizationMobile and laptop GPUs

Google's Gemma 4 introduces advancements in quantization-aware training, which can significantly improve the efficiency of AI models on mobile and laptop devices. This is crucial for deploying models where computational resources are limited.

Try it

python train.py --quantize

NVIDIA Releases CUDA-Oxide for Rust-To-CUDA CompilerNewAuto

Experimental Rust-to-CUDA compiler for safer GPU kernel development

ToolingNVIDIA GPUs

CUDA-Oxide allows developers to write CUDA GPU kernels in Rust, providing a safer alternative to traditional CUDA C/C++ development. This can lead to more robust and maintainable GPU code.

Try it

rustc my_cuda_kernel.rs --crate-name cuda_oxide

Linux 7.2 can boot on Apple M3 devicesNewAuto

Linux 7.2 mainline kernel will support booting on Apple M3 devices, including iMac and MacBook.

Cheap GPUCPU only

This means that users with Apple M3 devices will be able to run Linux, although it may not be immediately useful for end-users due to ongoing development and compatibility issues.

Try it

sudo apt update && sudo apt upgrade -y && sudo apt install linux-image-7.2

Linux 7.2 boots on Apple M3 devicesNewAuto

Linux 7.2 mainline kernel will support booting on Apple M3 devices, including iMac and MacBook products.

Cheap GPUApple M3

This means that users with Apple M3 devices will be able to run Linux on their hardware, potentially improving the utility of these devices for users who prefer or require Linux.

Try it

sudo apt update && sudo apt upgrade -y && sudo apt install linux-image-7.2

Post-train autonomous vehicle models in closed-loop with NVIDIA AlpamayoNewAuto

Use NVIDIA Alpamayo to bridge the gap between training and deployment for AV policies

TrainingNVIDIA GPU

NVIDIA Alpamayo helps in post-training autonomous vehicle models in a closed-loop, which is crucial for developing effective AV policies. This tool can be used to fine-tune and validate models before deployment.

Try it

nvidia-alpamayo --train-model --input-data <data>

Optimize Kernel Tracing for Function ParametersNewAuto

Improve kernel tracing by considering unused function parameters.

ToolingCPU only

Optimizing compilers may remove unused function parameters, which can affect kernel tracing or BPF subsystems. To address this, consider the implications of such optimizations on tracing functionality and adjust accordingly.

Try it

# Example of optimizing kernel tracing
# This is a placeholder command as the actual implementation depends on the specific kernel function and tracing setup.
# Please refer to the source article for detailed instructions.
echo 'Optimize kernel tracing for unused parameters'

Use NVIDIA Vera CPU for agentic workloads in AI factoriesNewAuto

Leverage NVIDIA Vera CPU to handle agentic workloads in AI factories.

Cheap GPUNVIDIA Vera CPU

NVIDIA Vera CPU sets a new standard for agentic workloads by enabling AI factories to preprocess and analyze large datasets more efficiently, leading to improved AI model training and scaling.

Try it

# Example command for running agentic workloads on NVIDIA Vera CPU
# This is a placeholder command and may vary based on actual usage
nvidia_vera_run --task <task_name> --data <data_path>

Advance AI infrastructure with NVIDIA DOCA In-Silicon SecurityNewAuto

Utilize NVIDIA DOCA for enhancing AI infrastructure with in-silicon security.

ToolingNVIDIA DOCA

NVIDIA DOCA introduces a new class of infrastructure for AI, transforming data into intelligence for autonomous AI agents with unprecedented security and efficiency, making it ideal for building AI factories.

Try it

# Example command for using NVIDIA DOCA
# This is a placeholder command and may vary based on actual usage
doca_command --option <option_value>

Post-train AV models in closed-loop with NVIDIA AlpamayoNewAuto

Use NVIDIA Alpamayo for post-training AV models to bridge the gap between training and deployment.

TrainingNVIDIA Alpamayo

NVIDIA Alpamayo is designed to help developers post-train autonomous vehicle models in a closed-loop system, which is crucial for refining AV policies and ensuring they perform well in real-world scenarios.

Try it

# Example command for post-training with NVIDIA Alpamayo
# This is a placeholder command and may vary based on actual usage
alpamayo_post_train --model <model_path> --data <data_path>

Automate AI model documentation with NVIDIA MCG ToolkitNewAuto

Use NVIDIA MCG Toolkit to streamline AI model documentation.

ToolingAny NVIDIA GPU

The NVIDIA MCG Toolkit automates AI model documentation, which is crucial for regulatory compliance and understanding model complexity.

Try it

nvidia-mcg-toolkit --help

Automate AI model documentation with NVIDIA MCG ToolkitNewAuto

Use NVIDIA MCG Toolkit to automate AI model documentation for regulatory compliance.

ToolingN/A

As AI models grow in complexity, regulatory scrutiny intensifies. The NVIDIA MCG Toolkit automates AI model documentation, helping teams comply with frameworks like California's AB-2013 and the EU AI Act.

Try it

N/A

Fast Startup for Inference Workloads on Kubernetes with NVIDIA DynamoNewAuto

Use NVIDIA Dynamo for fast startup of inference workloads on Kubernetes.

InferenceAny NVIDIA GPU

NVIDIA Dynamo provides fast startup for inference workloads on Kubernetes, addressing the cold-start problem in production inference deployments where demand fluctuates over time.

Try it

kubectl apply -f nvidia-dynamo.yaml

Run Step 3.7 Flash on NVIDIA GPUs for multimodal AINewAuto

Deploy multimodal AI applications on NVIDIA GPUs using Step 3.7 Flash.

ToolingAny NVIDIA GPU

AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and text. Step 3.7 Flash on NVIDIA GPUs enables this.

Try it

nvidia-smi

Automate AI model documentation with NVIDIA MCG ToolkitNewAuto

Use NVIDIA MCG Toolkit to automate AI model documentation for regulatory compliance.

ToolingCPU only

The NVIDIA MCG Toolkit helps automate AI model documentation, which is crucial for regulatory scrutiny under frameworks like California’s AB-2013 and the EU AI Act.

Try it

nvidia-mcg --help

Use NVIDIA Dynamo for fast inference workloads on KubernetesNewAuto

Accelerate inference workloads on Kubernetes with NVIDIA Dynamo.

InferenceNVIDIA GPU

NVIDIA Dynamo provides fast startup for inference workloads on Kubernetes, addressing the cold-start problem in production inference deployments where demand fluctuates over time.

Try it

N/A

Leverage CUDA Python 1.0 for unified GPU programmingNewAuto

CUDA 13.3 introduces CUDA Python 1.0 for easier GPU programming

ToolingAny NVIDIA GPU

CUDA 13.3 includes the release of CUDA Python 1.0, which provides a unified GPU programming model. This allows developers to write Python code that can run on both CPUs and GPUs, simplifying the development of GPU-accelerated applications.

Try it

import cupy as cp

Use NVIDIA Dynamo for fast startup of inference workloads on KubernetesNewAuto

NVIDIA Dynamo reduces inference workload startup times on Kubernetes

InferenceAny NVIDIA GPU

NVIDIA Dynamo Snapshot is designed to address the cold-start problem in inference deployments by reducing the time it takes to spin up inference replicas. This can be particularly beneficial for fluctuating demand scenarios, allowing for more efficient scaling of inference services.

Try it

kubectl apply -f <your_inference_service>

CUDA Python 1.0 released for unified GPU programmingNewAuto

Simplify GPU programming with CUDA Python 1.0

ToolingNVIDIA GPU

CUDA 13.3 introduces CUDA Python 1.0, which provides a familiar Python-based interface for GPU programming, making it easier to develop and deploy GPU-accelerated applications across various domains.

Try it

import cupy as cp

NVIDIA Dynamo Snapshot for fast inference workloads on KubernetesNewAuto

Optimize inference workloads on Kubernetes for fluctuating demand

InferenceNVIDIA GPU

NVIDIA Dynamo Snapshot addresses the cold-start problem in production inference deployments by enabling fast startup for inference workloads on Kubernetes, allowing inference replicas to scale elastically and handle demand fluctuations efficiently.

Try it

kubectl apply -f <dynamo-snapshot-config>

CUDA 13.3 Enhances GPU Development with Tile Programming and AutotuningNewAuto

CUDA 13.3 introduces tile programming in C++, compiler autotuning, and Python updates.

ToolingAny NVIDIA GPU

NVIDIA CUDA 13.3 brings new capabilities and performance optimizations, including tile programming and compiler autotuning, benefiting developers across the CUDA ecosystem.

Try it

nvcc -tile my_kernel.cu

Develop High-Performance GPU Kernels with CUDA TileNewAuto

NVIDIA CUDA Tile allows optimizing GPU kernels using tile-based programming in C++.

ToolingAny NVIDIA GPU

NVIDIA CUDA Tile programming enables developers to write highly optimized GPU kernels within existing C++ codebases, improving performance.

Try it

#include <cuda_tile.h>
__global__ void my_kernel() {
  // Use tile-based programming constructs
}

Use NVIDIA CompileIQ for auto-tuning compiler optionsNewAuto

NVIDIA CompileIQ auto-tunes compiler options for optimal GPU performance.

ToolingAny NVIDIA GPU

NVIDIA CompileIQ tackles the problem of finding the best compiler options for a specific GPU, unlocking performance potential automatically.

Try it

nvcc -autotune option

Develop GPU kernels with NVIDIA CUDA TileNewAuto

NVIDIA CUDA Tile allows developing optimized GPU kernels using tile-based programming in C++.

ToolingAny NVIDIA GPU

NVIDIA CUDA Tile programming enables the development of highly optimized GPU kernels within existing C++ codebases, improving performance.

Try it

nvcc -tile my_kernel.cu

Using LLMs for reviewing kernel patchesNewAuto

Large language models are being used to review kernel patches.

ToolingCPU only

The kernel community has been exploring the use of large language models (LLMs) for reviewing kernel patches, which could potentially improve the efficiency and accuracy of the review process.

Try it

# Example command to use LLM for patch review (hypothetical)

AI assistance in Linux kernel developmentNewAuto

AI tools like GitHub Copilot and Claude Code are being used to generate or co-author patches for the Linux kernel.

ToolingCPU only

This indicates a growing trend in leveraging AI for software development, specifically in the context of kernel development, which can help improve efficiency and reduce the maintenance burden.

Try it

git apply < generated-patch-file.patch

AI Assistance in Generating Linux Kernel PatchesNewAuto

GitHub Copilot and Claude Code are being used to generate or co-author Linux kernel patches.

ToolingCPU only

This indicates an increasing trend in the use of AI for software development, particularly in the context of kernel development, which can help automate and speed up the process of fixing issues.

Try it

git apply < generated_patch_file.patch

Get real-time visibility into GPU usage across Kubernetes clustersNewAuto

Maximize AI infrastructure value with deep visibility into GPU utilization

ToolingAny GPU

NVIDIA's solution provides real-time visibility into GPU usage across Kubernetes clusters, which is essential for optimizing resource allocation and maximizing the value of AI infrastructure.

Try it

# Placeholder command, actual implementation depends on NVIDIA's monitoring tools

Synthesize realistic 3D medical images at scaleNewAuto

Use NVIDIA's method to generate high-quality 3D medical imaging data for radiology AI

TrainingRTX 3090 24GB

NVIDIA's method allows for the synthesis of realistic 3D medical images at scale, addressing data scarcity and privacy issues in radiology AI. This can be crucial for training AI models on diverse and representative datasets.

Try it

# Placeholder command, actual implementation depends on NVIDIA's tools and frameworks

Develop web apps with local LLM inferenceNewAuto

Use local inference to avoid costs and latency of metered AI APIs

InferenceCPU only

Canonical's approach to developing web apps with local LLM inference can help reduce costs and latency associated with metered AI APIs. This enables rapid iteration during development without incurring high costs.

Try it

llama-serve --model <model_path> --port <port>

Develop web apps with local LLM inference to save costsNewAuto

Use local LLM inference for web app development to avoid metered API costs and enable rapid iteration.

InferenceCPU only

Working with metered AI APIs can be costly and slow down development. By using local LLM inference, developers can iterate rapidly and avoid high costs associated with API calls. This approach is particularly beneficial for web app development where rapid iteration is crucial.

Try it

# Example Python code for local LLM inference
from transformers import pipeline

# Load the local LLM model
llm = pipeline('llm', model='local-model-name')

# Perform inference
result = llm("Your input text")
print(result)

Cache Aware Scheduling to improve Linux kernel performanceNewAuto

Improves performance by reducing cache misses

Cheap GPUCPU only

CONFIG_SCHED_CACHE has been merged into the mainline kernel, which should improve performance by reducing cache misses. This can be particularly beneficial for AI workloads running on CPU-only systems.

Try it

# CONFIG_SCHED_CACHE is enabled by default in Linux 7.2
# No specific command needed, just ensure your kernel is updated

Properly evaluate AI agents using agentic techniquesNewAuto

Distinguish between evaluating AI models and AI agents

TrainingCPU only

Evaluating an AI model and evaluating an AI agent are related but answer fundamentally different questions. A model benchmark tests the capability of a model, whereas an AI agent evaluation focuses on how well the agent performs in a specific environment or task.

Mastering Agentic Techniques for AI Agent EvaluationNewAuto

Understand the difference between evaluating AI models and AI agents

TrainingCPU only

Evaluating an AI model tests its capability, while evaluating an AI agent answers different questions. This distinction is crucial for developers to understand when assessing their AI systems.

Fine-tune NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video GenerationNewAuto

Use LoRA/DoRA for fine-tuning NVIDIA Cosmos Predict 2.5 for robot video generation tasks

TrainingRTX 3090 24GB

This blog post details how to fine-tune NVIDIA's Cosmos Predict 2.5 model using LoRA/DoRA for robot video generation tasks. Fine-tuning allows the model to adapt to specific use cases, improving its performance on tasks like video generation for robotics.

Try it

model = AutoModelForCausalLM.from_pretrained('nvidia/cosmos-predict-2.5')
tokenizer = AutoTokenizer.from_pretrained('nvidia/cosmos-predict-2.5')

with torch.no_grad():
    inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits

Fine-tune NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video GenerationNewAuto

Use LoRA/DoRA to fine-tune NVIDIA Cosmos Predict 2.5 for improved robot video generation

TrainingRTX 3090 24GB

In this blog post, NVIDIA demonstrates how to fine-tune their Cosmos Predict 2.5 model using LoRA/DoRA for generating robot videos. This approach can potentially improve the quality and accuracy of generated videos, which is crucial for applications in robotics and autonomous systems.

Try it

python fine_tune.py --model cosmos-predict-2.5 --strategy lora-dora

Use AI fuzzing tools for uncovering Linux kernel bugsNewAuto

AI fuzzing tools can help identify bugs in the Linux kernel

ToolingCPU only

Linux's second-in-command Greg Kroah-Hartman has been leveraging new AI fuzzing tools for uncovering Linux kernel bugs, highlighting the effectiveness of AI in identifying critical issues in system software.

Try it

# Example command to run AI fuzzing tool (hypothetical)

Improve Linux GPU Drivers for Better Gaming ExperienceAuto

Valve is expanding their open-source Linux graphics driver team to enhance GPU drivers.

Cheap GPUCPU only

Valve has hired a leading Mesa developer from AMD to join their team, aiming to improve the Linux GPU drivers for a better gaming experience. This move signifies the importance of optimizing GPU drivers for better performance and compatibility on Linux systems.

Try it

sudo apt-get install mesa-utils

Enhancing Linux GPU Drivers for Better Gaming ExperienceAuto

Valve hires top Mesa developer from AMD to improve Linux GPU drivers.

Cheap GPURTX 3090 24GB

Valve continues to expand their open-source Linux graphics driver team, securing top talent to enhance the Linux GPU drivers for a better gaming experience, which can also benefit AI developers running GPU-intensive tasks.

Try it

sudo apt-get install mesa-utils

Linux Kernel Adds Documentation for Responsible AI UseAuto

New documentation in Linux 7.1 kernel focuses on responsible AI use for finding kernel bugs.

ToolingCPU only

This documentation provides guidelines on what qualifies as a security bug and how AI can be used responsibly to identify kernel bugs, which can help developers improve the security and reliability of their systems.

Try it

cat /usr/src/linux-7.1/Documentation/admin-guide/ai-bugs.rst

Document what qualifies as a security bug in AIAuto

Linux 7.1 kernel adds documentation for responsible AI use in finding kernel bugs.

ToolingCPU only

This documentation helps developers understand the criteria for what constitutes a security bug when using AI to find kernel bugs, promoting responsible AI practices.

Try it

cat /usr/src/linux-7.1/Documentation/security/responsible-ai-use.rst

Solving Agentic AI's Scale-Up Problem with NVIDIA Vera Rubin PlatformAuto

NVIDIA Vera Rubin platform addresses the scale-up problem in agentic AI inference workloads.

Cheap GPUNVIDIA Vera Rubin

Agentic inference has fundamentally changed the runtime dynamics of inference workloads by introducing non-deterministic trajectories. NVIDIA Vera Rubin platform is designed to solve the scale-up problem in agentic AI, enabling efficient inference on large models.

Try it

# Example command to run agentic AI inference on NVIDIA Vera Rubin
nvidia-smi -i 0 --gpu=0 --compute-mode=exclusive_process --threads=1 --mig=1g.1g.1g.1g.1g.1g.1g.1g

Arm Mali G1 Pro support in PanVK and Panfrost driversAuto

PanVK Vulkan driver and Panfrost Gallium3D driver now support Arm Mali G1-Pro GPU hardware.

Cheap GPUArm Mali G1-Pro

This support enables AI developers to utilize Arm Mali G1-Pro GPUs with open-source drivers, expanding the range of affordable hardware options for AI development.

Try it

git clone https://github.com/panfrost-driver/panfrost && cd panfrost && ./configure && make && sudo make install

Improved support for older AMD GPUs on LinuxAuto

Valve's Linux open-source graphics driver team enhances aging AMD GCN 1.0/1.1 era graphics cards.

Cheap GPUOlder AMD GCN 1.0/1.1 GPUs

This improvement allows for better utilization of older AMD GPUs on Linux, potentially enabling AI developers to run models on more affordable hardware.

Try it

sudo apt-get install mesa-driver

Arm Mali G1 Pro support in open-source PanVK & Panfrost driversAuto

PanVK Vulkan driver and Panfrost Gallium3D driver now support Arm Mali G1-Pro GPU hardware.

Cheap GPUArm Mali G1-Pro

This support enables AI development on devices with Arm Mali G1-Pro GPUs, which are typically found in lower-cost or embedded systems.

Try it

git clone https://github.com/panfrost-driver/panfrost && cd panfrost && ./configure && make && sudo make install

Learn from Parameter Golf AI-assisted research techniquesAuto

Explore AI-assisted machine learning research and model design

TrainingCPU only

Parameter Golf event gathered 1,000+ participants to explore AI-assisted research, coding agents, quantization, and novel model design under strict constraints.

Try it

# Placeholder for AI-assisted research commands

Optimize AI model serving by reducing pipeline frictionAuto

Improve model serving efficiency by optimizing the pipeline

InferenceRTX 3090 24GB

Invest weeks in fine-tuning models and discover exporting to production is not smooth. Use NVIDIA's techniques to optimize the serving pipeline and reduce friction.

Try it

# NVIDIA's model optimization and serving tips
# Placeholder for specific commands

Optimize GPU fleet with NVIDIA Fleet IntelligenceAuto

NVIDIA Fleet Intelligence provides real-time visibility and optimization for GPU fleets

ToolingNVIDIA GPUs

NVIDIA Fleet Intelligence can help maximize the utilization and efficiency of large GPU fleets. It provides insights into GPU usage and performance, enabling better resource allocation and optimization.

Try it

nvidia-fleet-intelligence --help

Utilize AWS for foundation model training and inferenceAuto

AWS provides building blocks for training and inference of foundation models

TrainingAWS GPU instances

AWS offers various services and tools that can be used to train and deploy foundation models efficiently. These services can help manage the complexity of large-scale model training and inference.

Try it

aws s3 sync s3://my-bucket/path/to/model /path/to/local/model

Use NVIDIA Fleet Intelligence for GPU fleet optimizationAuto

NVIDIA Fleet Intelligence provides real-time visibility and optimization for GPU fleets

ToolingNVIDIA GPUs

NVIDIA Fleet Intelligence offers tools for monitoring and optimizing large GPU fleets, enabling efficient resource utilization and performance.

Try it

nvidia-fleet-intelligence --query --update

Improve Linux per-core I/O performance with new patchesAuto

Jens Axboe is working on Linux patches to significantly improve per-core I/O performance.

ToolingCPU only

Following a presentation at the Linux storage, file-system, memory management and BPF summit, Jens Axboe was motivated to improve Linux I/O overhead compared to the Storage Performance Development Kit (SPDK), aiming for a 60% increase in per-core I/O performance.

Try it

# Apply the patches to the Linux kernel
git apply /path/to/axboe-patches

Utilize NVIDIA Dynamo for structured agentic exchangesAuto

NVIDIA Dynamo supports multi-turn agentic harness for preserving structured interactions.

ToolingAny NVIDIA GPU

NVIDIA Dynamo introduces multi-turn agentic harness support, allowing assistant turns to interleave reasoning with tool calls, and user turns to return responses. This is beneficial for applications requiring structured interactions and can be implemented on any NVIDIA GPU.

Try it

nvidia_dynamo_agentic_exchange

NVIDIA Releases CUDA-Oxide for Rust-To-CUDA CompilationAuto

Enables developers to use Rust for developing CUDA kernels for NVIDIA GPUs.

ToolingAny NVIDIA GPU

NVIDIA Labs project CUDA-Oxide 0.1 allows Rust to be used for developing CUDA kernels, potentially improving developer productivity and kernel performance.

Try it

cargo install cuda-oxide

Monitor real-time performance and debug with NCCL Inspector and PrometheusAuto

Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). NCCL Inspector and Prometheus can be used for real-time performance monitoring and faster debugging.

ToolingAny NVIDIA GPU

NCCL Inspector is a tool that can be used to monitor the performance of NCCL in real-time. It can help identify bottlenecks and issues in the communication between GPUs, leading to faster debugging and optimization of distributed deep learning workloads.

Try it

ncclInspector --log /path/to/logfile

Reduce VRAM usage and improve inference performance with NVIDIA Model OptimizerAuto

Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs.

QuantizationGeForce RTX GPUs

NVIDIA Model Optimizer supports post-training quantization, which can reduce VRAM usage and improve inference performance on consumer devices. This can be particularly useful for deploying models on devices with limited resources.

Try it

mo --input_model <model>.onnx --output_model <quantized_model>.onnx --data_type <INT8/FP16>

Monitor GPU-to-GPU communication with NCCL Inspector and PrometheusAuto

Use NCCL Inspector and Prometheus for real-time performance monitoring and faster debugging in distributed deep learning

ToolingNVIDIA GPUs

Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). NCCL Inspector and Prometheus provide real-time performance monitoring and faster debugging.

Try it

ncclInspector --log <log_file>

Reduce VRAM usage and improve inference performance with model quantizationAuto

Use NVIDIA Model Optimizer for post-training quantization to optimize models for consumer devices

QuantizationNVIDIA GeForce RTX GPUs

Model quantization reduces VRAM usage and improves inference performance on consumer devices such as NVIDIA GeForce RTX GPUs.

Try it

mo --input_model <model> --output_model <quantized_model> --data_type <INT8/FP16>

Speed up Unreal Engine NNE inference with NVIDIA TensorRTAuto

Utilize NVIDIA TensorRT to accelerate neural network inference in Unreal Engine

InferenceRTX 3090 24GB

NVIDIA TensorRT is used to speed up neural network execution in Unreal Engine, which can be beneficial for applications requiring real-time inference, such as in gaming or computer graphics.

Try it

# Example command to integrate TensorRT with Unreal Engine NNE inference
# Please refer to NVIDIA's documentation for specific implementation details
# This is a placeholder and not a direct command
tensorrt_command

Speed up Unreal Engine NNE inference with NVIDIA TensorRTAuto

NVIDIA TensorRT optimizes neural network inference for RTX runtime in Unreal Engine

InferenceRTX 3090 24GB

NVIDIA's TensorRT can be used to accelerate neural network inference in Unreal Engine, particularly for RTX runtime. This can significantly boost image quality and performance in computer graphics applications, making it suitable for high-performance graphics tasks.

Try it

# Example: Using NVIDIA TensorRT with Unreal Engine NNE
# This is a conceptual representation and not actual code
import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30
engine = builder.build_engine(network, config)

Use multi-token prediction for faster inference in Gemma 4Auto

Multi-token prediction can speed up inference by predicting multiple tokens at once

InferenceCPU only

Google's Gemma 4 uses multi-token prediction to accelerate inference. This approach predicts multiple tokens at once, which can significantly speed up inference times, especially for models that generate sequences of tokens.

Try it

# Example: Multi-token prediction in Gemma 4
# This is a conceptual representation and not actual code
multi_token_prediction = True
inference_speedup = perform_inference(model, multi_token_prediction)

Speed up Unreal Engine NNE inference with NVIDIA TensorRTAuto

Use NVIDIA TensorRT to accelerate neural network inference in Unreal Engine

InferenceRTX 3090 24GB

NVIDIA TensorRT is used to speed up neural network inference in Unreal Engine, which can be beneficial for applications requiring real-time image processing and graphics rendering.

Try it

# Example command to use TensorRT for inference
# This is a placeholder and actual command may vary based on specific use case
./inference_engine --model=model.trt --input=input_data

Utilize ROCm 7.2.3 for AMD GPU compute and AI stack enhancementsAuto

ROCm 7.2.3 offers minor improvements for AMD GPU compute and AI stack.

ToolingAMD GPUs

ROCm 7.2.3 provides minor updates to the open-source AMD GPU compute and AI stack, which can be beneficial for developers working on AI applications that leverage AMD GPUs.

Try it

module load rocm/7.2.3

Utilize ROCm 7.2.3 for AMD GPU compute and AI stack enhancementsAuto

This release includes minor improvements to the open-source AMD GPU compute and AI stack.

ToolingAMD GPUs

ROCm 7.2.3 offers minor updates to the open-source AMD GPU compute and AI stack, which can be beneficial for developers working with AMD GPUs to enhance their AI applications.

Try it

# Example command to install ROCm 7.2.3
sudo apt install rocm-dkms rocm-utils

GCC 16 compiler delivers performance gains over GCC 15Auto

GCC 16.1 compiler shows performance improvements over GCC 15

ToolingCPU only

The GCC 16.1 compiler has been released with new changes and performance gains over GCC 15, which can benefit AI development by improving compilation times and efficiency.

Try it

gcc --version

AI tooling influences an increase in kernel patchesAuto

The 7.1 kernel prepatch suggests AI tooling is contributing to more patches than usual.

ToolingN/A

The article mentions that the 7.1 kernel prepatch has more patches than usual, likely due to AI tooling. This implies that AI tooling is becoming more prevalent in kernel development, which could lead to more efficient and effective kernel patches.

Try it

N/A

Leverage AMD GAIA for building local AI agentsAuto

AMD GAIA simplifies building AI agents on your PC with local AI processing.

ToolingCPU only

AMD GAIA (Generative AI Is Awesome) is an open-source software that leverages the Lemonade SDK to facilitate the creation of AI agents on Windows and Linux PCs. This can be beneficial for developers looking to build AI applications that run locally without relying on cloud-based services.

Try it

git clone https://github.com/AMD-GAIA/gaia && cd gaia && ./build.sh

Use AMD's GAIA for local AI development on Windows and LinuxAuto

AMD's GAIA simplifies building AI agents on PCs with local AI processing.

ToolingCPU only

AMD's GAIA (Generative AI Is Awesome) is an open-source software that leverages the Lemonade SDK to facilitate the creation of AI agents on PCs with local AI processing capabilities. This can be beneficial for developers looking to build AI applications without relying on cloud-based services, enhancing privacy and reducing latency.

Try it

git clone https://github.com/AMD-GAIA/GAIA.git && cd GAIA && ./build.sh

Speed up Unreal Engine NNE inference with NVIDIA TensorRTAuto

NVIDIA TensorRT can accelerate neural network inference in Unreal Engine

InferenceRTX 30 series

NVIDIA TensorRT is used to speed up neural network inference in Unreal Engine, which can be beneficial for applications that require real-time graphics processing. The blog post suggests that using TensorRT can lead to significant performance improvements for tasks such as image quality enhancement and performance optimization.

Try it

# Example command to integrate TensorRT with Unreal Engine NNE
# Please refer to NVIDIA's documentation for specific implementation details
# https://developer.nvidia.com/blog/speed-up-unreal-engine-nne-inference-with-nvidia-tensorrt-for-rtx-runtime/

Automate GPU kernel translation with AI agentsAuto

Use AI to translate GPU kernels from Python to other languages, such as Julia, to expand the reach of your GPU-accelerated code.

ToolingAny GPU

NVIDIA's blog post discusses the use of AI agents to automate the translation of GPU kernels from Python to Julia, showcasing the potential for AI to assist in code portability and expansion across different programming environments.

Try it

# Example command to translate cuTile Python to cuTile.jl
# This is a conceptual representation and may not be directly executable
aitool translate --input cuTile.py --output cuTile.jl

Utilize NVIDIA CUDA Tile for tile-based GPU programmingAuto

NVIDIA CUDA Tile (cuTile) enables writing GPU kernels in terms of tile-level operations for better performance.

ToolingAny NVIDIA GPU

cuTile is a tile-based programming model that helps developers write GPU kernels focusing on tile-level operations such as loads, stores, and computations. This can lead to better performance and efficiency when programming for GPUs.

Try it

# Example of using cuTile for GPU kernel development
# This is a conceptual representation and not actual code
from cutile import Tile

# Define tile dimensions
tile = Tile(16, 16)

# Perform tile-based operations
# ...

Accelerate Page Migration for Better PerformanceAuto

AMD engineers are working on patches for accelerating page migration in the Linux kernel.

Cheap GPUCPU only

This patch series, originally started by a NVIDIA engineer in early 2025, aims to improve system performance by accelerating page migration. AMD's involvement suggests that this optimization could benefit a wide range of systems, not just those with AMD hardware.

Try it

git apply amd_page_migration.patch

Accelerate page migration for better performanceAuto

AMD engineers are working on patches to accelerate page migration for improved performance.

Cheap GPUCPU only

This patch series, originally started by a NVIDIA engineer, is now being worked on by AMD to accelerate page migration, which can lead to better performance in Linux systems.

Try it

git apply accelerated-page-migration.patch