AKO: Agentic Kernel Optimization
SOL Score comparison on 13 kernels from SOL-ExecBench.
See Experiment Results for details.
Overview
AKO is not a new agent or model — it is a harness (optimization environment) for existing coding agents such as Claude Code. It places the agent into a well-structured environment where the evaluation criteria, benchmarking tools, profiling interfaces, and optimization trajectory are all clearly defined and readily accessible.
The agent retains full autonomy: it decides when to read code, rewrite kernels, run benchmarks, invoke the profiler, search the web, or switch languages entirely. AKO simply provides the right environment for it to do so effectively.
Two harnesses are available: AKO4ALL for general-purpose kernel optimization with maximum flexibility, and AKO4FIB for standardized optimization against flashinfer-bench operators.
Motivation
AKO grew out of our experience in the NVIDIA Track | MLSys 2026 FlashInfer AI Kernel Generation Contest. During the competition, we found that existing approaches — search-based methods and fixed-workflow systems that treat the model as a one-shot generation black box — performed far worse than simply letting Claude Code optimize kernels directly.
Our key insight is that prior methods either heavily constrain model capabilities, or hand decision-making to external search algorithms while keeping the model under-informed — without a complete optimization trajectory, and without the ability to freely inspect error messages and profiling data. As models evolve rapidly, approaches that restrict their capabilities become increasingly counter-productive. Stronger models deserve fewer constraints, not more.
Based on this, we chose to build a better harness for existing coding agents rather than building yet another agent or model.
Tools
AKO ships two harnesses that share the same philosophy — empower the agent, don’t constrain it — but target different use cases.
AKO4ALL
A completely open, minimal harness. You provide just a kernel — and optionally a reference implementation, a custom benchmark script, context documents, or hints. The agent has maximum freedom to optimize however it sees fit.
- Built-in KernelBench evaluator supporting Triton, CUDA, C++, TileLang, CuTe DSL, HIP, and Python
- Or bring your own benchmark — any evaluation script works
- Fully open architecture: the agent can freely switch languages, restructure code, or change strategy
- Automatic trajectory recording and git integration
Trade-off: maximum flexibility means optimization stability depends more on the model’s own capability. Best for ad-hoc kernels, custom workloads, rapid prototyping, and non-standard evaluation setups.
AKO4FIB
Built on the flashinfer-bench
SDK. Uses its data format (flashinfer-trace), benchmark infrastructure, and profiling tools.
spawn.py creates isolated, self-contained optimization environments per operator.
- Supports Triton, CUDA, C++, TileLang, CuTe DSL, and Python
- Structured operator definitions with axes, workloads, and reference implementations
- Standardized per-workload scoring with automatic baseline caching
- Integrated NCU profiling with structured output
- Trajectory recording with full results metadata
- Local GPU and Modal cloud backends
Trade-off: more constrained (flashinfer-bench format requirements), but provides a more stable environment, more accurate evaluation, and a direct path to production integration. Best for flashinfer-bench operators, the MLSys contest, attention/sparse-attention kernels, and reproducible evaluations.
Experiment Results
We evaluated AKO4ALL on 13 kernels from SOL-ExecBench (Level 1, Level 2, and flashinfer-bench). 10 out of 13 kernels outperform the Scoring Baseline (NVIDIA multi-agent optimized kernel).
Configuration
- Coding agent: Claude Code (Opus 4.6)
- Development hardware: NVIDIA A100-SXM4-80GB
- Evaluation: SOL-ExecBench online platform (B200 hardware)
- Time per kernel: 1–2 hours
- We did not have direct access to B200 with NCU profiling privileges, creating an inherent disadvantage compared to local optimization on the target hardware.
Observations
- Optimization avoidance — Claude Code occasionally avoids deep kernel-level optimization, defaulting to surface-level tuning such as adjusting launch configurations instead of rewriting compute logic.
- Language preference — Claude Code strongly favors Triton and CUDA, and shows limited proficiency with other kernel DSLs (e.g., TileLang, CuTe DSL).
- Cross-architecture regression — CUDA kernels iteratively optimized on A100 often exhibit performance degradation when evaluated on B200, likely due to architectural differences between the development and evaluation hardware.
Future Work
- Optimize on B200 with NCU privileges for evaluation-aligned iteration
- Run more optimization iterations per kernel
- Expert-guided optimization during the process
Acknowledgments
We would like to thank the following open-source projects that inspired and supported the development of AKO:
- KernelBench — for providing the benchmark and evaluation format used by AKO4ALL’s built-in evaluator.
- FlashInfer — for the LLM inference kernel library and the flashinfer-bench benchmark infrastructure on which AKO4FIB is built.
- autoresearch and autokernel — AKO’s design was inspired by their work on autonomous optimization loops.