Finding Miscompiles for Fun, Not Profit
źródło ↗W kolejce do triage'u — analiza pojawi się po najbliższym przebiegu (Claude Code).
Treść źródłowa
Update June 1: The day after we published, Anthropic released Opus 4.8 and “ultracode” mode in Claude Code. Our preliminary experiments indicate that together these are significantly better at filtering out low-severity bugs, and that the cost per medium-to-high severity bug found is maybe 1/5 (with very large error bars) that of the workflow described in this article.I’ve worked on compilers for ML for the last decade across Google, Waymo, and OpenAI. This includes CUDA support in clang, XLA:GPU, Triton, and OpenAI’s custom hardware. I’ve seen stuff. But over the past week or so I had one of the most unsettling experiences of my career: In one afternoon, I spent more than $10,000 running AI agents over compiler code, finding hundreds of plausible bugs in LLVM, including many miscompiles and at least one that’s Quite Serious. This is the story of how I got here and where we might be going.In January 2026, I decided to try to find some bugs in LLVM (the compiler behind clang, rustc, and AMD’s GPU compiler, among others), as a personal project. Codex and I collaboratively wrote a fuzzer. The basic idea is to generate a random program, run it through part of the compiler, and then check that the resulting program after compilation does the same thing as the original program (usually just by running the two programs). I spent a few weeks on it, and I found and fixed five bugs in instcombine, LLVM’s peephole optimization pass. After that, my fuzzer started taking longer to find bugs, and I lost interest.Fast forward to mid-May 2026. I joined SemiAnalysis as a contractor, and I d…