SemiAnalysisDylan Patel2026-03-31

Dissecting Nvidia Blackwell - Tensor Cores, PTX Instructions, SASS, Floorsweep, Yield

Analiza AI (Claude Code)

W kolejce do triage'u — analiza pojawi się po najbliższym przebiegu (Claude Code).

Treść źródłowa

Nvidia’s Datacenter Blackwell GPU (SM100) represents one of the largest GPU microarchitecture change in a generation, yet no detailed whitepaper exists. Until today, there is no public datacenter Blackwell architecture microbenchmarking study on PTX and SASS instructions, such as UMMA and TMA, with a focus on AI workloads.After our in-depth Nvidia Tensor Core Evolution: From Volta To Blackwell article, SemiAnalysis has spent months of engineering time, tearing into the Blackwell architecture and measuring the raw PTX instruction performance, to establish hard practical performance upper bounds and compare them with the theoretical peaks. We do this to discover unit- and instruction-level hardware throughput and latency limits, providing a useful characterization from an ML systems and kernel development perspective. We focus on deep learning workload configurations, such as benchmarking asynchronous memory copy setups used in popular deep learning library FlashInfer.We open sourced our Blackwell micro-architecture-level benchmarking repo here. Please drop a star if you find it useful.AcknowledgementWe thank Nebius and Verda for providing B200 nodes for microbenchmarking. Their B200 nodes have the correct hardware counters enabled that makes NCU profiling possible. For users on cloud providers that don’t have NCU enabled, here is a workaround suggested by GPU Mode Mark Saroufim. We would also like to thank the authors of Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis and tcgen05 for dummies, whose work we based our code upon.F…