Skip to main content

Overview

A benchmark run evaluates every definition × solution × workload combination. For each combination it validates correctness against the reference implementation and measures kernel performance, producing a Trace with the results.

Quick Start

CLI

flashinfer-bench run --local /path/to/flashinfer-trace

Python API

from flashinfer_bench.bench import Benchmark, BenchmarkConfig
from flashinfer_bench.data import TraceSet

trace_set = TraceSet.from_path("/path/to/flashinfer-trace")
config = BenchmarkConfig.default()
benchmark = Benchmark(trace_set, config)
result_trace_set = benchmark.run_all(save_results=True)
run_all returns a new TraceSet that contains all the definitions, solutions, and workloads from the input, plus the newly generated traces from this run.

Benchmark Config

BenchmarkConfig is a Pydantic model that controls every aspect of a benchmark run. You can configure it directly in Python or load it from a YAML file.

Loading Configuration

The FlashInfer-Bench package bundles a default eval_config.yaml that sets sensible baselines for known op types. You can provide a custom configuration via the CLI (which replaces the bundled defaults):
flashinfer-bench run --local /path/to/flashinfer-trace --config my_config.yaml
(CLI flags like --rtol or --iterations are applied as overrides on top of the YAML) Or via the Python API:
# 1. Load the bundled eval_config.yaml + overrides (Default)
config = BenchmarkConfig.default(timeout_seconds=600)

# 2. Load a custom YAML file + overrides (replaces bundled defaults)
config = BenchmarkConfig.from_yaml("my_config.yaml", iterations=200)

# 3. Direct construction without loading any YAML
config = BenchmarkConfig(warmup_runs=5)

Configuration Structure

The configuration is divided into system-level fields (which apply to the runner) and eval config fields (which are resolved per-definition and passed to evaluators). Here is how the structure looks in Python and YAML:
BenchmarkConfig(
    # System-level fields
    use_isolated_runner=False,
    timeout_seconds=300,

    # Global default eval config fields
    warmup_runs=10,
    iterations=50,

    # Per-op-type overrides
    op_type_config={
        "moe": EvalConfig(required_matched_ratio=0.95),
        "sampling": EvalConfig(extra={"sampling_tvd_threshold": 0.2})
    },

    # Per-definition overrides (highest priority)
    definition_config={
        "rmsnorm_h4096": EvalConfig(rtol=0.001)
    }
)
# Top-level: system fields and global eval config defaults
use_isolated_runner: false
timeout_seconds: 300
warmup_runs: 10
iterations: 50

# Per-op-type overrides
op_type_config:
  moe:
    required_matched_ratio: 0.95
  sampling:
    extra:
      sampling_tvd_threshold: 0.2

# Per-definition overrides
definition_config:
  rmsnorm_h4096:
    rtol: 0.001

System Fields

These fields control the benchmarking engine and runner behavior.
FieldTypeDefaultDescription
use_isolated_runnerboolFalseUse isolated (subprocess) runner instead of persistent runner
definitionslist[str]NoneFilter to specific definition names
solutionslist[str]NoneFilter to specific solution names
timeout_secondsint300Per-workload timeout in seconds
profile_baselineboolTrueProfile the reference implementation

Eval Config Fields

These fields control correctness validation and performance measurement. You can set them at the top level (as global defaults), inside op_type_config, or inside definition_config.
FieldTypeGlobal DefaultDescription
warmup_runsint10Warmup iterations before timing
iterationsint50Timed iterations per trial
num_trialsint3Number of independent trials
rtolfloat1e-2Relative tolerance for correctness
atolfloat1e-2Absolute tolerance for correctness
required_matched_ratiofloatNoneMinimum ratio of element-wise matches (used by MoE, lowbit)
extradict{}Open-ended dictionary for evaluator-specific parameters
The extra dictionary is used to pass specialized parameters to specific evaluators. See the Sampling Evaluator below for an example.

Resolution Order

The final evaluation configuration for a definition is resolved from highest to lowest priority as follows. For the extra dict, layers are merged via dict.update() instead of direct replacement.
  1. Per-definition config
  2. Per-op-type config
  3. Top-level global defaults

Runners

RunnerFlagDescription
Persistent(default)Keeps a long-lived worker process per GPU. Lower overhead for many workloads.
Isolated--use-isolated-runnerSpawns a new subprocess per workload. Better fault isolation.

Evaluators

Different op types use specialized evaluators:
EvaluatorOp TypesNotes
Defaultgemm, rmsnorm, rope, gqa, mla, gdnElement-wise tolerance check (rtol/atol)
SamplingsamplingStatistical validation via TVD over multiple trials
LowbitlowbitElement-wise with required_matched_ratio
DSAdsa-pagedSpecialized sparse attention validation
Evaluators receive a resolved evaluation configuration with all fields fully resolved from the merge chain above. The SamplingEvaluator uses statistical validation via Total Variation Distance (TVD) over multiple trials. It reads its parameters directly from the extra dictionary in the configuration:
KeyDefaultDescription
sampling_validation_trials100Number of sampling rounds for TVD validation
sampling_tvd_threshold0.2Maximum total variation distance to pass

Custom evaluators

To add a custom evaluator:
  1. Subclass Evaluator (in flashinfer_bench/bench/evaluators/evaluator.py) and implement:
    • can_evaluate(definition) — return True for definitions this evaluator handles
    • build_baseline(definition, workload, cfg, device) — build reference outputs
    • check_correctness(definition, sol_runnable, inputs, ref_outputs, cfg, ...) — validate solution correctness
    • eval_performance(definition, sol_runnable, inputs, ref_mean_latency_ms, cfg, ...) — measure performance
  2. Register it in flashinfer_bench/bench/evaluators/registry.py by appending to the _EVALUATORS list. The first evaluator whose can_evaluate returns True is used; if none match, DefaultEvaluator is used.
  3. Use the extra dict in YAML config to pass evaluator-specific parameters (see Eval Config Fields above). Read them in your evaluator via cfg.extra.get("my_param", default_value).