AKO: Agentic Kernel Optimization

Overview

AKO is not a new agent or model — it is a harness (optimization environment) for existing coding agents such as Claude Code. It places the agent into a well-structured environment where the evaluation criteria, benchmarking tools, profiling interfaces, and optimization trajectory are all clearly defined and readily accessible.

The agent retains full autonomy: it decides when to read code, rewrite kernels, run benchmarks, invoke the profiler, search the web, or switch languages entirely. AKO simply provides the right environment for it to do so effectively.

The harness is deliberately thin. AKO contributes structure — a clean evaluation contract, profiling and benchmarking at the agent’s fingertips, a recorded optimization trajectory — and then gets out of the way. There is no new model, no search procedure, and no prompting trick behind the numbers on this page: the kernels are found by Claude Code reasoning over real profiler output. A thin harness around a strong agent is the whole idea.

AKO ships two tools that share this philosophy: AKO4ALL, a single drop-in Claude Code skill for general-purpose kernel optimization with maximum flexibility, and AKO4X, an extensible framework for campaign-grade optimization — closed-loop, multi-round runs with cross-run memory and a co-evolving harness (default benchmark: flashinfer-bench).

Motivation

AKO grew out of our experience in the NVIDIA Track | MLSys 2026 FlashInfer AI Kernel Generation Contest. During the competition, we found that existing approaches — search-based methods and fixed-workflow systems that treat the model as a one-shot generation black box — performed far worse than simply letting Claude Code optimize kernels directly.

Our key insight is that prior methods either heavily constrain model capabilities, or hand decision-making to external search algorithms while keeping the model under-informed — without a complete optimization trajectory, and without the ability to freely inspect error messages and profiling data. As models evolve rapidly, approaches that restrict their capabilities become increasingly counter-productive. Stronger models deserve fewer constraints, not more.

Based on this, we chose to build a better harness for existing coding agents rather than building yet another agent or model. The engineering is in the harness being small and honest — not in being clever. The capability is the agent’s; our job is to give it a trustworthy environment and stay out of the way.

Tools

AKO ships two tools that share the same philosophy — empower the agent, don’t constrain it — but target different use cases: a drop-in skill and an extensible framework.

All Kernels · NVIDIA GPUs · All Languages

AKO4ALL

A single drop-in Claude Code skill — completely open and minimal. Install it once, then point it at a kernel and at least one set of test inputs — embedded in the kernel, as a data file, or via a benchmark script — plus optionally a reference implementation, context documents, or hints. The agent has maximum freedom to optimize however it sees fit.

Built-in KernelBench evaluator supporting Triton, CUDA, C++, TileLang, CuTe DSL, Python, and HIP (via the CUDA loader)
Or bring your own benchmark — any evaluation script works
Fully open architecture: the agent can freely switch languages, restructure code, or change strategy
Automatic trajectory recording and git integration

Trade-off: maximum flexibility means optimization stability depends more on the model’s own capability. Best for ad-hoc kernels, custom workloads, rapid prototyping, and non-standard evaluation setups.

View on GitHub →

Single-Session or Campaign · Closed-Loop · eXtensible

AKO4X

The advanced, eXtensible harness. Run it as a single manual session — one spawn.py environment, optimize, done — or as a persistent multi-round campaign on one operator: each round, spawn.py creates a fresh, isolated child environment, and a per-operator archive accumulates every round's results so each round builds on all the ones before it. Benchmark-swappable through a thin adapter — default flashinfer-bench (flashinfer-trace format).

Single-session mode: one spawn.py environment, run the agent directly — no master, the simplest path
Multi-round campaigns: a master agent spawns one optimization round per isolated child environment, back-to-back
Per-operator archive of variants with lineage — each round is informed by all earlier rounds
Opt-in harness co-evolution: the agent can propose improvements to its own knowledge and tooling, gated and applied by the master
Supports Triton, CUDA, C++, TileLang, CuTe DSL, and Python
Per-workload scoring with baseline caching, integrated NCU profiling, trajectory recording, and an independent speedup audit
Local GPU and Modal cloud backends

Trade-off: more infrastructure than AKO4ALL, but it buys cross-run memory, reproducible per-workload evaluation, and an evolving optimization harness. Best for sustained, multi-round campaigns on a fixed operator — the MLSys contest, attention/sparse-attention and other flagship inference kernels, and reproducible evaluation.

View on GitHub →

Experiment Results

We evaluate AKO on the official FlashInfer-Bench workloads, running Claude Code inside each harness on an NVIDIA B200. Every number on this page is speedup over the FlashInfer expert kernel — the strongest hand-tuned baseline FlashInfer ships, profiled with the same isolated runner as our kernels. We never mix denominators.

AKO4X is our campaign-grade harness and the headline result. Across 10 kernel families (471 workloads) it beats the FlashInfer expert on 9 of them — geomean 1.14–1.43× across the production-inference breadth families, up to 2.30× on the GDN-prefill contest kernel and 30.71× on DSA sparse attention. On the five MLSys-2026 contest kernels it stays ahead of the concurrent KDA submission on all five, while KDA itself drops below the expert on two of them.

AKO4ALL is the lightweight, drop-in companion: a single skill you point at a kernel, no campaign infrastructure. On the four shared operators we re-measure AKO4ALL against the same FlashInfer expert so it lands on one apples-to-apples axis with AKO4X. With zero campaign infrastructure it recovers 85–98% of AKO4X’s geomean on three operators and pulls slightly ahead on MLA prefill, though with a wider per-workload spread — the easy on-ramp, while AKO4X is the campaign engine.

These aren’t just headline numbers. Every AKO4X kernel reported here — together with its full optimization trajectory (per-round variants, lineage, lessons, and dead-ends) — is public in the AKO4X reference/ archive, one directory per operator, each with a README.md anchor pointing to the best variant.

31×

DSA attention vs expert

geomean 30.71× · 23/23 pass

5 / 5

AKO4X ahead of KDA

AKO4X > KDA on every one

4 / 4

AKO4ALL drop-in beats expert

~1h, one prompt per kernel

Contest kernels: AKO4X and KDA vs the FlashInfer expert

Kernel	AKO4X vs FlashInfer expert	KDA vs FlashInfer expert	Workloads
DSA sparse attention	30.71×	3.78×	23/23
GDN prefill	2.30×	1.53×	100/100
MoE fp8 block-scale	1.19×	0.59×	19/19
GDN decode	1.17×	0.81×	54/54
DSA top-k indexer †	1.27× faster than KDA	— (ref)	128/128

Geomean speedup over the FlashInfer expert; all workloads pass. † DSA indexer also beats the expert (measured in the original contest run); the reference crashes on cu132, so this re-run shows a baseline-free AKO4X-vs-KDA latency comparison (0.0053 ms vs 0.0067 ms). Red = KDA below the expert (< 1.0×).

Additional families: AKO4X vs FlashInfer expert

Family	Geomean	Range min–max	Workloads
GQA paged decode	1.43×	0.85–1.81	48/48
MLA paged decode	1.42×	0.84–1.98	47/47
MLA paged prefill	1.36×	0.82–2.37	38/38
RMSNorm h128	1.14×	0.96–1.67	14/14
GEMM n2048 k4096	1.00×	cuBLAS unbeaten	—

Production inference kernels, geomean 1.14–1.43× over the FlashInfer expert. Minima below 1.0× are launch-bound shapes; GEMM ties cuBLAS (we do not claim a GEMM win).

AKO4ALL vs AKO4X on shared operators

Grouped bars comparing AKO4X (campaign system) and AKO4ALL (drop-in skill) on four shared operators, both vs the same FlashInfer expert; AKO4ALL within 85-100% of AKO4X and ahead on MLA prefill.

With zero campaign infrastructure, the drop-in skill stays within 85–98% of AKO4X on three operators and edges ahead on MLA prefill (wider spread).

Configuration

Coding agent: Claude Code — AKO4X on Claude Opus 4.7, AKO4ALL on Claude Opus 4.8 (both 1M context, max thinking effort)
Hardware: NVIDIA B200, via Modal cloud
Software: CUDA 13.2, PyTorch 2.12, Triton 3.6, FlashInfer (from main)
Benchmark: official FlashInfer-Bench workloads (flashinfer-trace format)
Baseline & metric: the FlashInfer expert kernel (flashinfer_wrapper_<hash>), profiled once; speedup = expert latency / our latency per workload, aggregated as geomean. A run counts only if every workload passes correctness.

Observations

Single denominator, end to end. AKO4X and AKO4ALL are measured against the same FlashInfer expert with the same isolated runner — we never mix baselines.
Honest negatives. Some launch-bound attention/norm workloads dip below 1.0× (min 0.82–0.96), and GEMM ties cuBLAS at 1.00× — we report geomean and min–max rather than cherry-picking.
Instance noise is real. Modal B200 cannot clock-lock, so we rank by the drift-robust geomean ratio (expert and kernel timed back-to-back in one container), not single-point latencies.
DSA sparse attention is the standout (30.71× geomean, 23/23), from a sparse rewrite the expert wrapper does not exploit.

Reward Hacking & Honest Evaluation

Reward hacking was the most persistent failure mode we hit — and it does not come from a hostile agent. A strong coding agent genuinely targets real latency and treats gaming the benchmark as out of bounds. The trouble is that the optimization space is dense with changes that raise the measured number without being real, portable wins: a kernel that runs once at CUDA-graph capture and is silently skipped on replay (its time invisible to the timer while a stale buffer still passes correctness), a pointer-keyed cache that only holds because the benchmark happens to reuse input buffers, an uninitialized output that drifts inside a loose tolerance. A weaker model rarely surfaces these; a capable one surfaces them constantly — the same thoroughness that finds a genuine 30× rewrite also finds the traps.

What makes them dangerous is that they pass the correctness check and fool the timer at the same time — so the agent often cannot tell a real win from an artifact, and neither can a naive harness. One variant’s entire measured speedup once came from a cache that was legal under our isolated runner; the moment it ran on a benchmark that reused input buffers, it failed all 38 workloads. The “speedup” was a property of the harness, not the kernel. A single passing number cannot tell the two apart — only an independent, adversarial audit can.

Inputs the agent doesn’t control. Fresh random inputs are regenerated every trial; the solution file has its input-generating code stripped, so a kernel can never tune itself to what it is tested against.
Independent staleness check. A separate audit mutates input values while holding tensor pointers fixed and rejects any kernel whose output stops changing — the tell-tale of a cached or capture-stale result, whoever caused it.
Isolated timing & correctness. Each workload runs in a fresh subprocess against tight per-operator tolerances; precomputation cannot leak into the measured region, and a run counts only if every workload passes.

Tech Report

Our tech report is now available: Read the AKO tech report (PDF).

Citation

If you find AKO useful, please cite:

@misc{ako2026,
  title        = {{AKO}: Agentic Kernel Optimization},
  author       = {Shuxiao Xie and Shuyang Xie and Dezhi Ran and Wei Yang and Tao Xie},
  year         = {2026},
  howpublished = {\url{https://tongminglaic.github.io/AKO}},
  note         = {Technical report}
}

Acknowledgments

We would like to thank the following open-source projects that inspired and supported the development of AKO:

KernelBench — for providing the benchmark and evaluation format used by AKO4ALL’s built-in evaluator.
FlashInfer — for the LLM inference kernel library and the flashinfer-bench benchmark infrastructure on which AKO4X is built.
autoresearch and autokernel — AKO’s design was inspired by their work on autonomous optimization loops.

We also thank Modal for the GPU credits that powered our MLSys 2026 competition runs.