AI Tips
Practical ways to train, run, or shrink AI models — explained for people new to AI. 52 new in last 30d.
New here? Each card answers one question: what is this and why should I care? Click a card to read the full explanation, including any new words. The command at the bottom is what you would type to try it on your own machine.
The AI flow — where each tip fits
Read left to rightAn AI model goes through these five phases. Click a phase to see the tips that apply there.
- 1Pre-training
A model first learns language by reading huge amounts of text. This costs millions of dollars and runs on thousands of GPUs.
- 2Fine-tuning
You take that pre-trained model and teach it your own data, your own task, or your own writing style. Hours to days, on a few GPUs.
e.g. QLoRA · Unsloth · DeepSpeed ZeRO-3 / FSDP
See Training tips → - 3Preference tuning
After fine-tuning, you teach the model which answers humans prefer. This makes it polite, helpful, and on-topic.
e.g. DPO / GRPO / KTO
See Training tips → - 4Quantization
The trained model is huge. Quantization shrinks it about 4× by storing its numbers with less precision, so it fits on cheap hardware.
e.g. GGUF + llama.cpp · AWQ / GPTQ · EXL2
See Quantization tips → - 5Inference / serving
Running the model so users can ask it questions. This is what your app actually does in production.
e.g. vLLM (PagedAttention) · Ollama · Speculative decoding
See Inference tips →
Pick the right serving stack — vLLM vs TGI vs AphroditeThree popular ways to host a model behind an API. vLLM is the safe default. TGI fits Hugging Face shops. Aphrodite is for heavy sampling.
Inferenceproduction GPU node
A serving stack is the program that turns a model into a web API. vLLM has the best throughput and an OpenAI-compatible API — it should be the default for new deployments. TGI (Text Generation Inference) is Hugging Face's, integrates nicely with their Inference Endpoints, and supports speculative decoding out of the box. Aphrodite is a vLLM fork tuned for high-temperature sampling, multi-LoRA, and role-play workloads. If you are unsure: start with vLLM.
Try it
# vLLM (most teams)
vllm serve mistralai/Mistral-Nemo-Instruct-2407
# TGI (HF infra)
docker run --gpus all ghcr.io/huggingface/text-generation-inference --model-id mistralai/Mistral-Nemo-Instruct-2407Liger Kernel — about 20% less training VRAM, one import lineReplaces a few slow parts of the model with faster code. You add one line and your training uses 20% less GPU memory.
Toolingany training rig
When a model trains, certain math operations create big temporary tensors (RMSNorm, SwiGLU, RoPE, the loss function). Liger Kernel — open-sourced by LinkedIn — rewrites these in Triton so the temporaries never get created. The result: about 20% less VRAM on Llama-3 training, up to 30% less on long-context runs, no change in accuracy. Works alongside FSDP and LoRA. One import line and you are done.
Try it
pip install liger-kernel && python -c "from liger_kernel.transformers import apply_liger_kernel_to_llama; apply_liger_kernel_to_llama()"DPO / GRPO / KTO — teach a model what good looks likeModern ways to use human feedback to make a model prefer good answers over bad ones. Simpler than the old RLHF setup.
Training1× A100 80GB or QLoRA on 24GB
Classic RLHF (the method behind ChatGPT) trains a separate reward model first, then runs reinforcement learning. It works but it is complicated. DPO (Direct Preference Optimization) skips the reward model — you just give it pairs of (good answer, bad answer) and it directly adjusts the model. GRPO scales this to math and reasoning where you can verify correctness automatically (DeepSeek used it for their math model). KTO needs only single labels (was this answer good? yes/no), so you can use cheap data like in-app thumbs-up/down. All three are in the trl library.
Try it
pip install trl && python -m trl.scripts.dpo --model meta-llama/Llama-3.1-8B --dataset HuggingFaceH4/ultrafeedback_binarizedMLX — run big models on a MacApple's framework that uses the Mac's shared memory. A 64 GB MacBook Pro can run a Llama-3 70B at usable speed.
Cheap GPUM2 / M3 / M4 Max 64GB+
On a normal PC the GPU has its own memory (VRAM) and the CPU has its own memory (RAM), and they have to copy data between them — that copy is slow. Apple Silicon Macs share one memory pool between CPU and GPU, so there is no copy. MLX is Apple's framework that takes advantage of this. A 64 GB M3 Max runs a 4-bit Llama-3 70B at about 10 tokens per second, which is usable for chat. The mlx-community on Hugging Face mirrors popular models pre-quantized for you.
Try it
pip install mlx-lm && mlx_lm.generate --model mlx-community/Llama-3.1-70B-Instruct-4bit --prompt 'hello'Unsloth — 2× faster LoRA fine-tuning, half the VRAMA drop-in library that makes fine-tuning twice as fast and uses half the GPU memory. Same result, less waiting.
TrainingRTX 3090 / 4090 24GB
Hugging Face's PEFT library is fine, but it is written in pure PyTorch which leaves performance on the table. Unsloth rewrites the LoRA forward and backward passes in Triton (NVIDIA's fast-kernel language). Result: the same loss curves, about 2× faster, about 50% less VRAM. Drop-in with Llama, Mistral, Phi, Gemma, Qwen — change a couple of import lines and it works. Their notebooks are a good starting point if you have never fine-tuned.
Try it
pip install unsloth && python -m unsloth.examples.llama3_8b_finetuneDeepSpeed ZeRO-3 / PyTorch FSDP — train models too big for one GPUSplits a giant model across many GPUs during training. The way teams fine-tune 70B+ models on 8 cards.
Training8× A100 80GB or H100s
When you train a model, you also need memory for gradients, optimizer state, and activations — together about 4× the model itself. ZeRO is a method that shards (splits) those across all your GPUs so each card only stores a slice. ZeRO-3 also shards the model parameters themselves. Combined with mixed-precision (bf16) training and activation checkpointing, this lets a normal 8× A100 box train a 70B model. PyTorch FSDP is the in-tree alternative with the same idea.
Try it
deepspeed --num_gpus 8 train.py --deepspeed ds_config_zero3.jsonOllama — one command to run any model on your laptopLike 'docker run' but for AI models. Auto-downloads, picks a good size for your machine, and exposes a local API.
ToolingMac / any GPU
Ollama wraps llama.cpp in a clean command-line tool plus a local HTTP API on port 11434 that speaks the OpenAI format. You type one command and it downloads the model, picks a quantized size that fits your machine, and starts serving. Perfect for testing prototypes, building agents, and anything on a Mac with Apple Silicon. The Modelfile lets you bake in a system prompt so the model behaves the way you want every time.
Try it
curl -fsSL https://ollama.com/install.sh | sh && ollama run llama3.1:70b-instruct-q4_K_MMixture of Experts (MoE) — huge model, fast model latencyThe model is a team of small experts. For each word it only uses two of them. So a 141B model answers as fast as a 39B one.
Techniquedepends on top-k
A normal ('dense') model uses all of its parameters for every word. An MoE replaces one block with several smaller experts, plus a router that picks the top-k experts (usually 2) per word. So Mixtral-8x22B has 141B total parameters but only ~39B are active per word — answers come at the speed of a 39B model. The trade-off: you still have to load all experts in GPU memory, so VRAM is high. DeepSeek-V3 and Mixtral are the well-known open MoEs.
Try it
vllm serve mistralai/Mixtral-8x22B-Instruct-v0.1 --tensor-parallel-size 4FlashAttention 3 — about 2× faster attention on H100 / BlackwellA faster version of the most expensive math step inside an AI model. Free speed-up on the newest NVIDIA cards.
ToolingH100 / B100 / B200 / 5090
Attention is the math step that lets the model look at every word it has read so far. Done naively it reads the GPU's slow memory many times per word. FlashAttention rewrites the same math so it streams data through the GPU's tiny fast memory once. Version 3 adds support for FP8 (8-bit math) on H100 and Blackwell GPUs, which doubles throughput compared to FP16. PyTorch 2.5 picks it up automatically when you call scaled_dot_product_attention on supported hardware. The biggest free perf win you can flip on a transformer.
Try it
pip install flash-attn --no-build-isolationSpeculative decoding — make the big model faster, for freeA tiny model guesses the next words. The big model just checks the guesses in one batch. 2–3× faster, same answer quality.
Techniqueany inference target
Big models are slow because they generate one word (or token) at a time. With speculative decoding you also load a small, cheap 'draft' model. The draft model writes the next 5 tokens. The big model then runs once and checks all 5 in parallel. Tokens it agrees with are kept; tokens it does not agree with are replaced. The final answer is identical to what the big model would have written alone — you just spent fewer expensive runs to get there. Built into vLLM, TGI, and llama.cpp.
Try it
vllm serve meta-llama/Llama-3.1-70B --speculative-model meta-llama/Llama-3.2-1B --num-speculative-tokens 5vLLM — serve a model 5–10× faster than the basic libraryA serving engine that handles many requests at the same time without wasting GPU memory. The default choice for production.
Inference1× A100 / H100 / 4090
When a model answers, it has to remember what it has already written — this is called the KV cache and it eats a lot of GPU memory. The basic Hugging Face library reserves a fixed chunk for each request, which leaves a lot of memory unused. vLLM borrows the idea of virtual memory from operating systems: it splits the cache into small pages and reuses them, so the same GPU can serve 5–10× more requests at the same time. It also exposes an OpenAI-compatible HTTP API, so any client that talks to OpenAI can talk to vLLM with one URL change.
Try it
pip install vllm && vllm serve meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 1EXL2 — the fastest way to run shrunk models on a gaming GPUPicks the best precision for each part of the model. About 2× faster than the GGUF format on a single RTX card.
InferenceRTX 3090 / 4090 / 5090
EXL2 is a quantization format made for ExLlamaV2, a runtime built only for NVIDIA GPUs (no CPU offload). It is smarter than the others because it spends more bits on the important parts of the model and fewer bits on the rest, hitting an average target like 4.0 bits per weight. On one RTX 4090 a Llama-3 70B in EXL2 4.0bpw runs at about 20 tokens per second — roughly twice the speed of an equivalent GGUF setup. Use this when you need pure speed on a single card.
Try it
pip install exllamav2 && python examples/inference.py -m models/llama-3-70b-exl2-4.0bpw -p 'hello'AWQ / GPTQ — shrink a model 4× without losing qualitySmarter compression that picks which numbers matter most and keeps those accurate. The model takes a quarter of the memory but answers almost the same.
Quantizationany 16GB+ GPU
Quantization (storing numbers in 4-bit instead of 16-bit) usually loses some quality. AWQ — short for Activation-aware Weight Quantization — figures out which channels in the model are most important and protects them. The result: 4-bit AWQ keeps about 99% of the quality of the full model, but uses a quarter of the memory. GPTQ is the older method and slightly weaker. AWQ also runs faster on modern NVIDIA cards because vLLM and TGI ship optimized code for it.
Try it
python -c "from awq import AutoAWQForCausalLM; m = AutoAWQForCausalLM.from_quantized('casperhansen/llama-3-70b-awq')"GGUF + llama.cpp — run a 70B model on a gaming PCA file format that packs a model so it can split between your GPU and your normal computer memory. Lets a 70B model run on a 4090.
QuantizationRTX 4090 24GB (with offload)
GGUF is the file format llama.cpp uses to ship pre-shrunk (quantized) models. Quantization means storing the model's numbers with less precision so the file is smaller — Q4_K_M is the popular setting, about 4× smaller than the original. The clever part is layer offload: you tell llama.cpp how many layers to keep on the GPU (-ngl 35) and the rest stay in your normal computer RAM. So a 70B model that needs ~40 GB can split: 24 GB on a 4090 + 16 GB in system RAM. About 8 tokens per second on a 4090 with 64 GB DDR5.
Try it
llama-cli -m llama-3.1-70b-q4_k_m.gguf -ngl 35 -c 8192 -p "hello"QLoRA — fine-tune a 70B model on one consumer GPUTeach a giant model new skills using only 24 GB of GPU memory, instead of the 320 GB you would normally need.
TrainingRTX 3090 / 4090 24GB
Fine-tuning means starting from an already-trained model and teaching it your own data. A 70B-parameter model normally needs about 320 GB of GPU memory (VRAM) to fine-tune — that costs thousands per hour in the cloud. QLoRA does two clever things: it stores the original model in 4-bit numbers (about 4× smaller), and it only trains a tiny add-on called a LoRA adapter, not the whole model. The full setup fits on a single 24 GB gaming card with no real loss in quality. Result: home-lab fine-tuning is suddenly possible.
Try it
pip install bitsandbytes peft transformers && python -m peft.examples.qlora --model meta-llama/Llama-3.1-70B --bits 4Boost Mixture-of-Experts training throughput with advanced fusion kernelsNewAutoIncrease training throughput for Mixture-of-Experts models using advanced fusion kernels
TrainingAny GPU
Advanced fusion kernels can significantly boost the training throughput of Mixture-of-Experts models, which are a key component in large-scale AI systems, by optimizing the communication between experts.
Try it
moe_model = MixtureOfExpertsModel()
advanced_fusion_kernels.optimize_training_throughput(moe_model)Fine-tune biological foundation models with LoRA using NVIDIA BioNeMoNewAutoUse NVIDIA BioNeMo recipes to fine-tune biological foundation models with LoRA
TrainingAny GPU
NVIDIA BioNeMo provides recipes for fine-tuning biological foundation models with LoRA, allowing for efficient and effective updates to these large models in the field of computational biology.
Try it
nvidia_bionemo_recipes = NVIDIABioNeMoRecipes()
lora_finetuned_model = nvidia_bionemo_recipes.fine_tune_biological_model(model, data)Boost Mixture-of-Experts training throughput with advanced fusion kernelsNewAutoIncrease MoE model training throughput using advanced fusion kernels
TrainingRTX 3090 24GB
NVIDIA's blog post discusses how to boost the training throughput of Mixture-of-Experts (MoE) models by using advanced fusion kernels, which can significantly improve the efficiency of training large-scale AI systems.
Try it
python -m moe_train --advanced-fusion-kernelsFine-tune biological foundation models with LoRA using NVIDIA BioNeMoNewAutoUse NVIDIA BioNeMo recipes to fine-tune foundation models with LoRA for computational biology tasks
TrainingRTX 3090 24GB
NVIDIA BioNeMo provides recipes for fine-tuning large foundation models like ESM2 using LoRA, allowing for efficient and effective updates to these models for specific computational biology tasks.
Try it
python -m biodemo.lora_finetune --config-file config.yamlSimplify AI infrastructure at the edge with Cisco and CanonicalNewAutoTest-time inference is shifting to the edge to reduce latency and bandwidth consumption.
Cheap GPUCPU only
Legacy infrastructure was not designed for AI era requirements. Large-scale model training remains centralized in data centers, while test-time inference is moving to the edge to reduce latency and bandwidth consumption.
Try it
# Example command to deploy AI model on edge device
sudo docker run -d --name ai-edge-model my-ai-model:latestShift test-time inference to edge to reduce latency and bandwidthNewAutoMove inference from centralized data centers to edge devices to improve performance
InferenceCPU only
Legacy infrastructure was not designed for AI requirements. Test-time inference is rapidly shifting to the edge to reduce latency and bandwidth consumption. This shift can significantly improve performance for applications requiring real-time responses.
Try it
sudo apt-get install edge-device-softwareSimplify AI infrastructure at the edge with Cisco and CanonicalNewAutoTest-time inference is shifting to the edge to reduce latency and bandwidth consumption
InferenceCPU only
Legacy infrastructure was not designed for AI era requirements. Large-scale model training remains centralized in data centers, but test-time inference is moving to the edge to reduce latency and bandwidth consumption.
Try it
# Example command to deploy AI model at the edge
sudo apt-get install -y edge-ai-model
edge-ai-model --deployOptimize AI inference at the edge to reduce latency and bandwidthNewAutoShifting test-time inference to the edge can enhance performance
InferenceCPU only
As large-scale model training remains centralized in data centers, test-time inference is rapidly shifting to the edge to reduce latency and bandwidth consumption, which is crucial for real-time AI applications.
Try it
# Example command to deploy model inference at the edge
# This is a placeholder and actual command depends on the specific edge device and software stack
edge_deploy_model.shOptimize AI infrastructure for edge deploymentNewAutoShift test-time inference to the edge to reduce latency and bandwidth consumption
Cheap GPUCPU only
Legacy infrastructure was not designed for AI requirements. Large-scale model training remains centralized in data centers, but test-time inference is rapidly shifting to the edge. This can help reduce latency and bandwidth consumption, which is crucial for real-time AI applications.
Try it
sudo apt-get install -y edge-ai-optimization-toolConvert FP8 Checkpoints to NVIDIA TensorRT for High-Performance InferenceNewAutoBridge the gap between model optimization and production deployment
QuantizationNVIDIA GPU
This process enables faster inference by converting a quantized checkpoint into an NVIDIA TensorRT engine, which is crucial for deploying optimized models in production.
Try it
tensorrt_script = 'python post_training_quantization/convert_to_tensorrt.py --fp8_checkpoint /path/to/fp8/checkpoint --output /path/to/output/engine'Convert FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRTNewAutoBridging the gap between model optimization and production deployment
QuantizationNVIDIA GPU
Converting a quantized checkpoint into an NVIDIA TensorRT engine enables faster inference, improving model performance and reducing latency.
Try it
tensorrt_script = 'python -m torch.ao.quantization.fx.prepare_qat_fx ' + '-m model_fp32 ' + '-m model_qat'Train models faster with JAX and MaxText using NVFP4 on NVIDIA BlackwellNewAutoImprove throughput when training large language models with JAX and MaxText on NVIDIA Blackwell
TrainingNVIDIA Blackwell
When training spans trillions of tokens across thousands of accelerators, every percentage point of step improvement matters. Using JAX and MaxText with NVFP4 on NVIDIA Blackwell can help achieve better throughput, which is crucial for pre-training frontier LLMs.
Try it
# Example command using JAX and MaxText
# This is a placeholder command and should be replaced with the actual usage
jax.run(your_model_training_function)Train models faster with JAX and MaxText using NVFP4 on NVIDIA BlackwellNewAutoImprove throughput when training large language models with JAX and MaxText on NVIDIA Blackwell
TrainingNVIDIA Blackwell
When training spans trillions of tokens across thousands of accelerators, every percentage point of step improvement matters. Using JAX with MaxText and NVFP4 on NVIDIA Blackwell can significantly improve throughput, leading to faster training times for large language models.
Try it
jax.run(your_model, your_data, max_text=True, nvfp4=True)GPU-accelerated video decoding with Vulkan in FirefoxNewAutoFirefox now supports Vulkan Video for improved video decoding performance
InferenceVulkan compatible GPU
Mozilla Firefox has merged initial support for Vulkan Video, which allows for GPU-accelerated video decoding. This can lead to performance improvements and reduced CPU usage when playing videos in the browser, especially on systems with Vulkan-compatible GPUs.
Try it
# Enable Vulkan Video decoding in Firefox
# This may require enabling specific settings or flags in the browserVulkan Video support in Firefox for GPU-accelerated video decodingNewAutoMozilla Firefox now supports Vulkan Video for GPU-accelerated video decoding
ToolingVulkan compatible GPU
This feature allows Firefox to leverage the Vulkan API for video decoding, potentially improving performance and efficiency on systems with Vulkan-compatible GPUs.
Try it
firefox --enable-features=VulkanVideoDecodeOptimize AI energy consumption with Ubuntu 26.04 LTSNewAutoUbuntu 26.04 LTS focuses on reducing energy consumption for AI workloads, which is crucial for cost and sustainability.
Cheap GPUCPU only
Ubuntu 26.04 LTS is designed to maximize the value extracted from GPU clusters by focusing on energy efficiency, measured in tokens per watt (TpW). This metric helps CEOs and infrastructure teams manage the cost of AI workloads more effectively.
Try it
sudo apt-get install ubuntu-26.04-ltsGoogle introduces Gemma 4 for QAT modelsNewAutoOptimizing model compression for mobile and laptop efficiency
QuantizationMobile and laptop GPUs
Google's Gemma 4 introduces advancements in quantization-aware training, which can significantly improve the efficiency of AI models on mobile and laptop devices. This is crucial for deploying models where computational resources are limited.
Try it
python train.py --quantizeNVIDIA Releases CUDA-Oxide for Rust-To-CUDA CompilerNewAutoExperimental Rust-to-CUDA compiler for safer GPU kernel development
ToolingNVIDIA GPUs
CUDA-Oxide allows developers to write CUDA GPU kernels in Rust, providing a safer alternative to traditional CUDA C/C++ development. This can lead to more robust and maintainable GPU code.
Try it
rustc my_cuda_kernel.rs --crate-name cuda_oxideLinux 7.2 can boot on Apple M3 devicesNewAutoLinux 7.2 mainline kernel will support booting on Apple M3 devices, including iMac and MacBook.
Cheap GPUCPU only
This means that users with Apple M3 devices will be able to run Linux, although it may not be immediately useful for end-users due to ongoing development and compatibility issues.
Try it
sudo apt update && sudo apt upgrade -y && sudo apt install linux-image-7.2Linux 7.2 boots on Apple M3 devicesNewAutoLinux 7.2 mainline kernel will support booting on Apple M3 devices, including iMac and MacBook products.
Cheap GPUApple M3
This means that users with Apple M3 devices will be able to run Linux on their hardware, potentially improving the utility of these devices for users who prefer or require Linux.
Try it
sudo apt update && sudo apt upgrade -y && sudo apt install linux-image-7.2Post-train autonomous vehicle models in closed-loop with NVIDIA AlpamayoNewAutoUse NVIDIA Alpamayo to bridge the gap between training and deployment for AV policies
TrainingNVIDIA GPU
NVIDIA Alpamayo helps in post-training autonomous vehicle models in a closed-loop, which is crucial for developing effective AV policies. This tool can be used to fine-tune and validate models before deployment.
Try it
nvidia-alpamayo --train-model --input-data <data>Optimize Kernel Tracing for Function ParametersNewAutoImprove kernel tracing by considering unused function parameters.
ToolingCPU only
Optimizing compilers may remove unused function parameters, which can affect kernel tracing or BPF subsystems. To address this, consider the implications of such optimizations on tracing functionality and adjust accordingly.
Try it
# Example of optimizing kernel tracing
# This is a placeholder command as the actual implementation depends on the specific kernel function and tracing setup.
# Please refer to the source article for detailed instructions.
echo 'Optimize kernel tracing for unused parameters'Use NVIDIA Vera CPU for agentic workloads in AI factoriesNewAutoLeverage NVIDIA Vera CPU to handle agentic workloads in AI factories.
Cheap GPUNVIDIA Vera CPU
NVIDIA Vera CPU sets a new standard for agentic workloads by enabling AI factories to preprocess and analyze large datasets more efficiently, leading to improved AI model training and scaling.
Try it
# Example command for running agentic workloads on NVIDIA Vera CPU
# This is a placeholder command and may vary based on actual usage
nvidia_vera_run --task <task_name> --data <data_path>Advance AI infrastructure with NVIDIA DOCA In-Silicon SecurityNewAutoUtilize NVIDIA DOCA for enhancing AI infrastructure with in-silicon security.
ToolingNVIDIA DOCA
NVIDIA DOCA introduces a new class of infrastructure for AI, transforming data into intelligence for autonomous AI agents with unprecedented security and efficiency, making it ideal for building AI factories.
Try it
# Example command for using NVIDIA DOCA
# This is a placeholder command and may vary based on actual usage
doca_command --option <option_value>Post-train AV models in closed-loop with NVIDIA AlpamayoNewAutoUse NVIDIA Alpamayo for post-training AV models to bridge the gap between training and deployment.
TrainingNVIDIA Alpamayo
NVIDIA Alpamayo is designed to help developers post-train autonomous vehicle models in a closed-loop system, which is crucial for refining AV policies and ensuring they perform well in real-world scenarios.
Try it
# Example command for post-training with NVIDIA Alpamayo
# This is a placeholder command and may vary based on actual usage
alpamayo_post_train --model <model_path> --data <data_path>Automate AI model documentation with NVIDIA MCG ToolkitNewAutoUse NVIDIA MCG Toolkit to streamline AI model documentation.
ToolingAny NVIDIA GPU
The NVIDIA MCG Toolkit automates AI model documentation, which is crucial for regulatory compliance and understanding model complexity.
Try it
nvidia-mcg-toolkit --helpAutomate AI model documentation with NVIDIA MCG ToolkitNewAutoUse NVIDIA MCG Toolkit to automate AI model documentation for regulatory compliance.
ToolingN/A
As AI models grow in complexity, regulatory scrutiny intensifies. The NVIDIA MCG Toolkit automates AI model documentation, helping teams comply with frameworks like California's AB-2013 and the EU AI Act.
Try it
N/AFast Startup for Inference Workloads on Kubernetes with NVIDIA DynamoNewAutoUse NVIDIA Dynamo for fast startup of inference workloads on Kubernetes.
InferenceAny NVIDIA GPU
NVIDIA Dynamo provides fast startup for inference workloads on Kubernetes, addressing the cold-start problem in production inference deployments where demand fluctuates over time.
Try it
kubectl apply -f nvidia-dynamo.yamlRun Step 3.7 Flash on NVIDIA GPUs for multimodal AINewAutoDeploy multimodal AI applications on NVIDIA GPUs using Step 3.7 Flash.
ToolingAny NVIDIA GPU
AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and text. Step 3.7 Flash on NVIDIA GPUs enables this.
Try it
nvidia-smiAutomate AI model documentation with NVIDIA MCG ToolkitNewAutoUse NVIDIA MCG Toolkit to automate AI model documentation for regulatory compliance.
ToolingCPU only
The NVIDIA MCG Toolkit helps automate AI model documentation, which is crucial for regulatory scrutiny under frameworks like California’s AB-2013 and the EU AI Act.
Try it
nvidia-mcg --helpUse NVIDIA Dynamo for fast inference workloads on KubernetesNewAutoAccelerate inference workloads on Kubernetes with NVIDIA Dynamo.
InferenceNVIDIA GPU
NVIDIA Dynamo provides fast startup for inference workloads on Kubernetes, addressing the cold-start problem in production inference deployments where demand fluctuates over time.
Try it
N/ALeverage CUDA Python 1.0 for unified GPU programmingNewAutoCUDA 13.3 introduces CUDA Python 1.0 for easier GPU programming
ToolingAny NVIDIA GPU
CUDA 13.3 includes the release of CUDA Python 1.0, which provides a unified GPU programming model. This allows developers to write Python code that can run on both CPUs and GPUs, simplifying the development of GPU-accelerated applications.
Try it
import cupy as cpUse NVIDIA Dynamo for fast startup of inference workloads on KubernetesNewAutoNVIDIA Dynamo reduces inference workload startup times on Kubernetes
InferenceAny NVIDIA GPU
NVIDIA Dynamo Snapshot is designed to address the cold-start problem in inference deployments by reducing the time it takes to spin up inference replicas. This can be particularly beneficial for fluctuating demand scenarios, allowing for more efficient scaling of inference services.
Try it
kubectl apply -f <your_inference_service>CUDA Python 1.0 released for unified GPU programmingNewAutoSimplify GPU programming with CUDA Python 1.0
ToolingNVIDIA GPU
CUDA 13.3 introduces CUDA Python 1.0, which provides a familiar Python-based interface for GPU programming, making it easier to develop and deploy GPU-accelerated applications across various domains.
Try it
import cupy as cpNVIDIA Dynamo Snapshot for fast inference workloads on KubernetesNewAutoOptimize inference workloads on Kubernetes for fluctuating demand
InferenceNVIDIA GPU
NVIDIA Dynamo Snapshot addresses the cold-start problem in production inference deployments by enabling fast startup for inference workloads on Kubernetes, allowing inference replicas to scale elastically and handle demand fluctuations efficiently.
Try it
kubectl apply -f <dynamo-snapshot-config>CUDA 13.3 Enhances GPU Development with Tile Programming and AutotuningNewAutoCUDA 13.3 introduces tile programming in C++, compiler autotuning, and Python updates.
ToolingAny NVIDIA GPU
NVIDIA CUDA 13.3 brings new capabilities and performance optimizations, including tile programming and compiler autotuning, benefiting developers across the CUDA ecosystem.
Try it
nvcc -tile my_kernel.cuDevelop High-Performance GPU Kernels with CUDA TileNewAutoNVIDIA CUDA Tile allows optimizing GPU kernels using tile-based programming in C++.
ToolingAny NVIDIA GPU
NVIDIA CUDA Tile programming enables developers to write highly optimized GPU kernels within existing C++ codebases, improving performance.
Try it
#include <cuda_tile.h>
__global__ void my_kernel() {
// Use tile-based programming constructs
}Use NVIDIA CompileIQ for auto-tuning compiler optionsNewAutoNVIDIA CompileIQ auto-tunes compiler options for optimal GPU performance.
ToolingAny NVIDIA GPU
NVIDIA CompileIQ tackles the problem of finding the best compiler options for a specific GPU, unlocking performance potential automatically.
Try it
nvcc -autotune optionDevelop GPU kernels with NVIDIA CUDA TileNewAutoNVIDIA CUDA Tile allows developing optimized GPU kernels using tile-based programming in C++.
ToolingAny NVIDIA GPU
NVIDIA CUDA Tile programming enables the development of highly optimized GPU kernels within existing C++ codebases, improving performance.
Try it
nvcc -tile my_kernel.cuUsing LLMs for reviewing kernel patchesNewAutoLarge language models are being used to review kernel patches.
ToolingCPU only
The kernel community has been exploring the use of large language models (LLMs) for reviewing kernel patches, which could potentially improve the efficiency and accuracy of the review process.
Try it
# Example command to use LLM for patch review (hypothetical)AI assistance in Linux kernel developmentNewAutoAI tools like GitHub Copilot and Claude Code are being used to generate or co-author patches for the Linux kernel.
ToolingCPU only
This indicates a growing trend in leveraging AI for software development, specifically in the context of kernel development, which can help improve efficiency and reduce the maintenance burden.
Try it
git apply < generated-patch-file.patchAI Assistance in Generating Linux Kernel PatchesNewAutoGitHub Copilot and Claude Code are being used to generate or co-author Linux kernel patches.
ToolingCPU only
This indicates an increasing trend in the use of AI for software development, particularly in the context of kernel development, which can help automate and speed up the process of fixing issues.
Try it
git apply < generated_patch_file.patchGet real-time visibility into GPU usage across Kubernetes clustersNewAutoMaximize AI infrastructure value with deep visibility into GPU utilization
ToolingAny GPU
NVIDIA's solution provides real-time visibility into GPU usage across Kubernetes clusters, which is essential for optimizing resource allocation and maximizing the value of AI infrastructure.
Try it
# Placeholder command, actual implementation depends on NVIDIA's monitoring toolsSynthesize realistic 3D medical images at scaleNewAutoUse NVIDIA's method to generate high-quality 3D medical imaging data for radiology AI
TrainingRTX 3090 24GB
NVIDIA's method allows for the synthesis of realistic 3D medical images at scale, addressing data scarcity and privacy issues in radiology AI. This can be crucial for training AI models on diverse and representative datasets.
Try it
# Placeholder command, actual implementation depends on NVIDIA's tools and frameworksDevelop web apps with local LLM inferenceNewAutoUse local inference to avoid costs and latency of metered AI APIs
InferenceCPU only
Canonical's approach to developing web apps with local LLM inference can help reduce costs and latency associated with metered AI APIs. This enables rapid iteration during development without incurring high costs.
Try it
llama-serve --model <model_path> --port <port>Develop web apps with local LLM inference to save costsNewAutoUse local LLM inference for web app development to avoid metered API costs and enable rapid iteration.
InferenceCPU only
Working with metered AI APIs can be costly and slow down development. By using local LLM inference, developers can iterate rapidly and avoid high costs associated with API calls. This approach is particularly beneficial for web app development where rapid iteration is crucial.
Try it
# Example Python code for local LLM inference
from transformers import pipeline
# Load the local LLM model
llm = pipeline('llm', model='local-model-name')
# Perform inference
result = llm("Your input text")
print(result)Cache Aware Scheduling to improve Linux kernel performanceNewAutoImproves performance by reducing cache misses
Cheap GPUCPU only
CONFIG_SCHED_CACHE has been merged into the mainline kernel, which should improve performance by reducing cache misses. This can be particularly beneficial for AI workloads running on CPU-only systems.
Try it
# CONFIG_SCHED_CACHE is enabled by default in Linux 7.2
# No specific command needed, just ensure your kernel is updatedProperly evaluate AI agents using agentic techniquesNewAutoDistinguish between evaluating AI models and AI agents
TrainingCPU only
Evaluating an AI model and evaluating an AI agent are related but answer fundamentally different questions. A model benchmark tests the capability of a model, whereas an AI agent evaluation focuses on how well the agent performs in a specific environment or task.
SourceMastering Agentic Techniques for AI Agent EvaluationNewAutoUnderstand the difference between evaluating AI models and AI agents
TrainingCPU only
Evaluating an AI model tests its capability, while evaluating an AI agent answers different questions. This distinction is crucial for developers to understand when assessing their AI systems.
SourceFine-tune NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video GenerationNewAutoUse LoRA/DoRA for fine-tuning NVIDIA Cosmos Predict 2.5 for robot video generation tasks
TrainingRTX 3090 24GB
This blog post details how to fine-tune NVIDIA's Cosmos Predict 2.5 model using LoRA/DoRA for robot video generation tasks. Fine-tuning allows the model to adapt to specific use cases, improving its performance on tasks like video generation for robotics.
Try it
model = AutoModelForCausalLM.from_pretrained('nvidia/cosmos-predict-2.5')
tokenizer = AutoTokenizer.from_pretrained('nvidia/cosmos-predict-2.5')
with torch.no_grad():
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logitsFine-tune NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video GenerationNewAutoUse LoRA/DoRA to fine-tune NVIDIA Cosmos Predict 2.5 for improved robot video generation
TrainingRTX 3090 24GB
In this blog post, NVIDIA demonstrates how to fine-tune their Cosmos Predict 2.5 model using LoRA/DoRA for generating robot videos. This approach can potentially improve the quality and accuracy of generated videos, which is crucial for applications in robotics and autonomous systems.
Try it
python fine_tune.py --model cosmos-predict-2.5 --strategy lora-doraUse AI fuzzing tools for uncovering Linux kernel bugsNewAutoAI fuzzing tools can help identify bugs in the Linux kernel
ToolingCPU only
Linux's second-in-command Greg Kroah-Hartman has been leveraging new AI fuzzing tools for uncovering Linux kernel bugs, highlighting the effectiveness of AI in identifying critical issues in system software.
Try it
# Example command to run AI fuzzing tool (hypothetical)Improve Linux GPU Drivers for Better Gaming ExperienceAutoValve is expanding their open-source Linux graphics driver team to enhance GPU drivers.
Cheap GPUCPU only
Valve has hired a leading Mesa developer from AMD to join their team, aiming to improve the Linux GPU drivers for a better gaming experience. This move signifies the importance of optimizing GPU drivers for better performance and compatibility on Linux systems.
Try it
sudo apt-get install mesa-utilsEnhancing Linux GPU Drivers for Better Gaming ExperienceAutoValve hires top Mesa developer from AMD to improve Linux GPU drivers.
Cheap GPURTX 3090 24GB
Valve continues to expand their open-source Linux graphics driver team, securing top talent to enhance the Linux GPU drivers for a better gaming experience, which can also benefit AI developers running GPU-intensive tasks.
Try it
sudo apt-get install mesa-utilsLinux Kernel Adds Documentation for Responsible AI UseAutoNew documentation in Linux 7.1 kernel focuses on responsible AI use for finding kernel bugs.
ToolingCPU only
This documentation provides guidelines on what qualifies as a security bug and how AI can be used responsibly to identify kernel bugs, which can help developers improve the security and reliability of their systems.
Try it
cat /usr/src/linux-7.1/Documentation/admin-guide/ai-bugs.rstDocument what qualifies as a security bug in AIAutoLinux 7.1 kernel adds documentation for responsible AI use in finding kernel bugs.
ToolingCPU only
This documentation helps developers understand the criteria for what constitutes a security bug when using AI to find kernel bugs, promoting responsible AI practices.
Try it
cat /usr/src/linux-7.1/Documentation/security/responsible-ai-use.rstSolving Agentic AI's Scale-Up Problem with NVIDIA Vera Rubin PlatformAutoNVIDIA Vera Rubin platform addresses the scale-up problem in agentic AI inference workloads.
Cheap GPUNVIDIA Vera Rubin
Agentic inference has fundamentally changed the runtime dynamics of inference workloads by introducing non-deterministic trajectories. NVIDIA Vera Rubin platform is designed to solve the scale-up problem in agentic AI, enabling efficient inference on large models.
Try it
# Example command to run agentic AI inference on NVIDIA Vera Rubin
nvidia-smi -i 0 --gpu=0 --compute-mode=exclusive_process --threads=1 --mig=1g.1g.1g.1g.1g.1g.1g.1gArm Mali G1 Pro support in PanVK and Panfrost driversAutoPanVK Vulkan driver and Panfrost Gallium3D driver now support Arm Mali G1-Pro GPU hardware.
Cheap GPUArm Mali G1-Pro
This support enables AI developers to utilize Arm Mali G1-Pro GPUs with open-source drivers, expanding the range of affordable hardware options for AI development.
Try it
git clone https://github.com/panfrost-driver/panfrost && cd panfrost && ./configure && make && sudo make installImproved support for older AMD GPUs on LinuxAutoValve's Linux open-source graphics driver team enhances aging AMD GCN 1.0/1.1 era graphics cards.
Cheap GPUOlder AMD GCN 1.0/1.1 GPUs
This improvement allows for better utilization of older AMD GPUs on Linux, potentially enabling AI developers to run models on more affordable hardware.
Try it
sudo apt-get install mesa-driverArm Mali G1 Pro support in open-source PanVK & Panfrost driversAutoPanVK Vulkan driver and Panfrost Gallium3D driver now support Arm Mali G1-Pro GPU hardware.
Cheap GPUArm Mali G1-Pro
This support enables AI development on devices with Arm Mali G1-Pro GPUs, which are typically found in lower-cost or embedded systems.
Try it
git clone https://github.com/panfrost-driver/panfrost && cd panfrost && ./configure && make && sudo make installLearn from Parameter Golf AI-assisted research techniquesAutoExplore AI-assisted machine learning research and model design
TrainingCPU only
Parameter Golf event gathered 1,000+ participants to explore AI-assisted research, coding agents, quantization, and novel model design under strict constraints.
Try it
# Placeholder for AI-assisted research commandsOptimize AI model serving by reducing pipeline frictionAutoImprove model serving efficiency by optimizing the pipeline
InferenceRTX 3090 24GB
Invest weeks in fine-tuning models and discover exporting to production is not smooth. Use NVIDIA's techniques to optimize the serving pipeline and reduce friction.
Try it
# NVIDIA's model optimization and serving tips
# Placeholder for specific commandsOptimize GPU fleet with NVIDIA Fleet IntelligenceAutoNVIDIA Fleet Intelligence provides real-time visibility and optimization for GPU fleets
ToolingNVIDIA GPUs
NVIDIA Fleet Intelligence can help maximize the utilization and efficiency of large GPU fleets. It provides insights into GPU usage and performance, enabling better resource allocation and optimization.
Try it
nvidia-fleet-intelligence --helpUtilize AWS for foundation model training and inferenceAutoAWS provides building blocks for training and inference of foundation models
TrainingAWS GPU instances
AWS offers various services and tools that can be used to train and deploy foundation models efficiently. These services can help manage the complexity of large-scale model training and inference.
Try it
aws s3 sync s3://my-bucket/path/to/model /path/to/local/modelUse NVIDIA Fleet Intelligence for GPU fleet optimizationAutoNVIDIA Fleet Intelligence provides real-time visibility and optimization for GPU fleets
ToolingNVIDIA GPUs
NVIDIA Fleet Intelligence offers tools for monitoring and optimizing large GPU fleets, enabling efficient resource utilization and performance.
Try it
nvidia-fleet-intelligence --query --updateImprove Linux per-core I/O performance with new patchesAutoJens Axboe is working on Linux patches to significantly improve per-core I/O performance.
ToolingCPU only
Following a presentation at the Linux storage, file-system, memory management and BPF summit, Jens Axboe was motivated to improve Linux I/O overhead compared to the Storage Performance Development Kit (SPDK), aiming for a 60% increase in per-core I/O performance.
Try it
# Apply the patches to the Linux kernel
git apply /path/to/axboe-patchesUtilize NVIDIA Dynamo for structured agentic exchangesAutoNVIDIA Dynamo supports multi-turn agentic harness for preserving structured interactions.
ToolingAny NVIDIA GPU
NVIDIA Dynamo introduces multi-turn agentic harness support, allowing assistant turns to interleave reasoning with tool calls, and user turns to return responses. This is beneficial for applications requiring structured interactions and can be implemented on any NVIDIA GPU.
Try it
nvidia_dynamo_agentic_exchangeNVIDIA Releases CUDA-Oxide for Rust-To-CUDA CompilationAutoEnables developers to use Rust for developing CUDA kernels for NVIDIA GPUs.
ToolingAny NVIDIA GPU
NVIDIA Labs project CUDA-Oxide 0.1 allows Rust to be used for developing CUDA kernels, potentially improving developer productivity and kernel performance.
Try it
cargo install cuda-oxideMonitor real-time performance and debug with NCCL Inspector and PrometheusAutoDistributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). NCCL Inspector and Prometheus can be used for real-time performance monitoring and faster debugging.
ToolingAny NVIDIA GPU
NCCL Inspector is a tool that can be used to monitor the performance of NCCL in real-time. It can help identify bottlenecks and issues in the communication between GPUs, leading to faster debugging and optimization of distributed deep learning workloads.
Try it
ncclInspector --log /path/to/logfileReduce VRAM usage and improve inference performance with NVIDIA Model OptimizerAutoModel quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs.
QuantizationGeForce RTX GPUs
NVIDIA Model Optimizer supports post-training quantization, which can reduce VRAM usage and improve inference performance on consumer devices. This can be particularly useful for deploying models on devices with limited resources.
Try it
mo --input_model <model>.onnx --output_model <quantized_model>.onnx --data_type <INT8/FP16>Monitor GPU-to-GPU communication with NCCL Inspector and PrometheusAutoUse NCCL Inspector and Prometheus for real-time performance monitoring and faster debugging in distributed deep learning
ToolingNVIDIA GPUs
Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). NCCL Inspector and Prometheus provide real-time performance monitoring and faster debugging.
Try it
ncclInspector --log <log_file>Reduce VRAM usage and improve inference performance with model quantizationAutoUse NVIDIA Model Optimizer for post-training quantization to optimize models for consumer devices
QuantizationNVIDIA GeForce RTX GPUs
Model quantization reduces VRAM usage and improves inference performance on consumer devices such as NVIDIA GeForce RTX GPUs.
Try it
mo --input_model <model> --output_model <quantized_model> --data_type <INT8/FP16>Speed up Unreal Engine NNE inference with NVIDIA TensorRTAutoUtilize NVIDIA TensorRT to accelerate neural network inference in Unreal Engine
InferenceRTX 3090 24GB
NVIDIA TensorRT is used to speed up neural network execution in Unreal Engine, which can be beneficial for applications requiring real-time inference, such as in gaming or computer graphics.
Try it
# Example command to integrate TensorRT with Unreal Engine NNE inference
# Please refer to NVIDIA's documentation for specific implementation details
# This is a placeholder and not a direct command
tensorrt_commandSpeed up Unreal Engine NNE inference with NVIDIA TensorRTAutoNVIDIA TensorRT optimizes neural network inference for RTX runtime in Unreal Engine
InferenceRTX 3090 24GB
NVIDIA's TensorRT can be used to accelerate neural network inference in Unreal Engine, particularly for RTX runtime. This can significantly boost image quality and performance in computer graphics applications, making it suitable for high-performance graphics tasks.
Try it
# Example: Using NVIDIA TensorRT with Unreal Engine NNE
# This is a conceptual representation and not actual code
import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30
engine = builder.build_engine(network, config)Use multi-token prediction for faster inference in Gemma 4AutoMulti-token prediction can speed up inference by predicting multiple tokens at once
InferenceCPU only
Google's Gemma 4 uses multi-token prediction to accelerate inference. This approach predicts multiple tokens at once, which can significantly speed up inference times, especially for models that generate sequences of tokens.
Try it
# Example: Multi-token prediction in Gemma 4
# This is a conceptual representation and not actual code
multi_token_prediction = True
inference_speedup = perform_inference(model, multi_token_prediction)Speed up Unreal Engine NNE inference with NVIDIA TensorRTAutoUse NVIDIA TensorRT to accelerate neural network inference in Unreal Engine
InferenceRTX 3090 24GB
NVIDIA TensorRT is used to speed up neural network inference in Unreal Engine, which can be beneficial for applications requiring real-time image processing and graphics rendering.
Try it
# Example command to use TensorRT for inference
# This is a placeholder and actual command may vary based on specific use case
./inference_engine --model=model.trt --input=input_dataUtilize ROCm 7.2.3 for AMD GPU compute and AI stack enhancementsAutoROCm 7.2.3 offers minor improvements for AMD GPU compute and AI stack.
ToolingAMD GPUs
ROCm 7.2.3 provides minor updates to the open-source AMD GPU compute and AI stack, which can be beneficial for developers working on AI applications that leverage AMD GPUs.
Try it
module load rocm/7.2.3Utilize ROCm 7.2.3 for AMD GPU compute and AI stack enhancementsAutoThis release includes minor improvements to the open-source AMD GPU compute and AI stack.
ToolingAMD GPUs
ROCm 7.2.3 offers minor updates to the open-source AMD GPU compute and AI stack, which can be beneficial for developers working with AMD GPUs to enhance their AI applications.
Try it
# Example command to install ROCm 7.2.3
sudo apt install rocm-dkms rocm-utilsGCC 16 compiler delivers performance gains over GCC 15AutoGCC 16.1 compiler shows performance improvements over GCC 15
ToolingCPU only
The GCC 16.1 compiler has been released with new changes and performance gains over GCC 15, which can benefit AI development by improving compilation times and efficiency.
Try it
gcc --versionAI tooling influences an increase in kernel patchesAutoThe 7.1 kernel prepatch suggests AI tooling is contributing to more patches than usual.
ToolingN/A
The article mentions that the 7.1 kernel prepatch has more patches than usual, likely due to AI tooling. This implies that AI tooling is becoming more prevalent in kernel development, which could lead to more efficient and effective kernel patches.
Try it
N/ALeverage AMD GAIA for building local AI agentsAutoAMD GAIA simplifies building AI agents on your PC with local AI processing.
ToolingCPU only
AMD GAIA (Generative AI Is Awesome) is an open-source software that leverages the Lemonade SDK to facilitate the creation of AI agents on Windows and Linux PCs. This can be beneficial for developers looking to build AI applications that run locally without relying on cloud-based services.
Try it
git clone https://github.com/AMD-GAIA/gaia && cd gaia && ./build.shUse AMD's GAIA for local AI development on Windows and LinuxAutoAMD's GAIA simplifies building AI agents on PCs with local AI processing.
ToolingCPU only
AMD's GAIA (Generative AI Is Awesome) is an open-source software that leverages the Lemonade SDK to facilitate the creation of AI agents on PCs with local AI processing capabilities. This can be beneficial for developers looking to build AI applications without relying on cloud-based services, enhancing privacy and reducing latency.
Try it
git clone https://github.com/AMD-GAIA/GAIA.git && cd GAIA && ./build.shSpeed up Unreal Engine NNE inference with NVIDIA TensorRTAutoNVIDIA TensorRT can accelerate neural network inference in Unreal Engine
InferenceRTX 30 series
NVIDIA TensorRT is used to speed up neural network inference in Unreal Engine, which can be beneficial for applications that require real-time graphics processing. The blog post suggests that using TensorRT can lead to significant performance improvements for tasks such as image quality enhancement and performance optimization.
Try it
# Example command to integrate TensorRT with Unreal Engine NNE
# Please refer to NVIDIA's documentation for specific implementation details
# https://developer.nvidia.com/blog/speed-up-unreal-engine-nne-inference-with-nvidia-tensorrt-for-rtx-runtime/Automate GPU kernel translation with AI agentsAutoUse AI to translate GPU kernels from Python to other languages, such as Julia, to expand the reach of your GPU-accelerated code.
ToolingAny GPU
NVIDIA's blog post discusses the use of AI agents to automate the translation of GPU kernels from Python to Julia, showcasing the potential for AI to assist in code portability and expansion across different programming environments.
Try it
# Example command to translate cuTile Python to cuTile.jl
# This is a conceptual representation and may not be directly executable
aitool translate --input cuTile.py --output cuTile.jlUtilize NVIDIA CUDA Tile for tile-based GPU programmingAutoNVIDIA CUDA Tile (cuTile) enables writing GPU kernels in terms of tile-level operations for better performance.
ToolingAny NVIDIA GPU
cuTile is a tile-based programming model that helps developers write GPU kernels focusing on tile-level operations such as loads, stores, and computations. This can lead to better performance and efficiency when programming for GPUs.
Try it
# Example of using cuTile for GPU kernel development
# This is a conceptual representation and not actual code
from cutile import Tile
# Define tile dimensions
tile = Tile(16, 16)
# Perform tile-based operations
# ...Accelerate Page Migration for Better PerformanceAutoAMD engineers are working on patches for accelerating page migration in the Linux kernel.
Cheap GPUCPU only
This patch series, originally started by a NVIDIA engineer in early 2025, aims to improve system performance by accelerating page migration. AMD's involvement suggests that this optimization could benefit a wide range of systems, not just those with AMD hardware.
Try it
git apply amd_page_migration.patchAccelerate page migration for better performanceAutoAMD engineers are working on patches to accelerate page migration for improved performance.
Cheap GPUCPU only
This patch series, originally started by a NVIDIA engineer, is now being worked on by AMD to accelerate page migration, which can lead to better performance in Linux systems.
Try it
git apply accelerated-page-migration.patch