Overview
A benchmark run evaluates every definition × solution × workload combination. For each combination it validates correctness against the reference implementation and measures kernel performance, producing a Trace with the results.Quick Start
CLI
Python API
run_all returns a new TraceSet that contains all the definitions, solutions, and workloads from the input, plus the newly generated traces from this run.
Benchmark Config
BenchmarkConfig is a Pydantic model that controls every aspect of a benchmark run. You can configure it directly in Python or load it from a YAML file.
Loading Configuration
The FlashInfer-Bench package bundles a defaulteval_config.yaml that sets sensible baselines for known op types.
You can provide a custom configuration via the CLI (which replaces the bundled defaults):
--rtol or --iterations are applied as overrides on top of the YAML)
Or via the Python API:
Configuration Structure
The configuration is divided into system-level fields (which apply to the runner) and eval config fields (which are resolved per-definition and passed to evaluators). Here is how the structure looks in Python and YAML:System Fields
These fields control the benchmarking engine and runner behavior.| Field | Type | Default | Description |
|---|---|---|---|
use_isolated_runner | bool | False | Use isolated (subprocess) runner instead of persistent runner |
definitions | list[str] | None | Filter to specific definition names |
solutions | list[str] | None | Filter to specific solution names |
timeout_seconds | int | 300 | Per-workload timeout in seconds |
profile_baseline | bool | True | Profile the reference implementation |
Eval Config Fields
These fields control correctness validation and performance measurement. You can set them at the top level (as global defaults), insideop_type_config, or inside definition_config.
| Field | Type | Global Default | Description |
|---|---|---|---|
warmup_runs | int | 10 | Warmup iterations before timing |
iterations | int | 50 | Timed iterations per trial |
num_trials | int | 3 | Number of independent trials |
rtol | float | 1e-2 | Relative tolerance for correctness |
atol | float | 1e-2 | Absolute tolerance for correctness |
required_matched_ratio | float | None | Minimum ratio of element-wise matches (used by MoE, lowbit) |
extra | dict | {} | Open-ended dictionary for evaluator-specific parameters |
extra dictionary is used to pass specialized parameters to specific evaluators. See the Sampling Evaluator below for an example.
Resolution Order
The final evaluation configuration for a definition is resolved from highest to lowest priority as follows. For theextra dict, layers are merged via dict.update() instead of direct replacement.
- Per-definition config
- Per-op-type config
- Top-level global defaults
Runners
| Runner | Flag | Description |
|---|---|---|
| Persistent | (default) | Keeps a long-lived worker process per GPU. Lower overhead for many workloads. |
| Isolated | --use-isolated-runner | Spawns a new subprocess per workload. Better fault isolation. |
Evaluators
Different op types use specialized evaluators:| Evaluator | Op Types | Notes |
|---|---|---|
| Default | gemm, rmsnorm, rope, gqa, mla, gdn | Element-wise tolerance check (rtol/atol) |
| Sampling | sampling | Statistical validation via TVD over multiple trials |
| Lowbit | lowbit | Element-wise with required_matched_ratio |
| DSA | dsa-paged | Specialized sparse attention validation |
SamplingEvaluator uses statistical validation via Total Variation Distance (TVD) over multiple trials. It reads its parameters directly from the extra dictionary in the configuration:
| Key | Default | Description |
|---|---|---|
sampling_validation_trials | 100 | Number of sampling rounds for TVD validation |
sampling_tvd_threshold | 0.2 | Maximum total variation distance to pass |
Custom evaluators
To add a custom evaluator:-
Subclass
Evaluator(inflashinfer_bench/bench/evaluators/evaluator.py) and implement:can_evaluate(definition)— returnTruefor definitions this evaluator handlesbuild_baseline(definition, workload, cfg, device)— build reference outputscheck_correctness(definition, sol_runnable, inputs, ref_outputs, cfg, ...)— validate solution correctnesseval_performance(definition, sol_runnable, inputs, ref_mean_latency_ms, cfg, ...)— measure performance
-
Register it in
flashinfer_bench/bench/evaluators/registry.pyby appending to the_EVALUATORSlist. The first evaluator whosecan_evaluatereturnsTrueis used; if none match,DefaultEvaluatoris used. -
Use the
extradict in YAML config to pass evaluator-specific parameters (see Eval Config Fields above). Read them in your evaluator viacfg.extra.get("my_param", default_value).

