flashinfer-bench serve exposes an HTTP service for evaluating submitted Solution objects against workloads in a local TraceSet.
It is a benchmark evaluation service, not a general model inference server.
Start The Server
Install the server dependencies first:| Flag | Type | Required | Default | Description |
|---|---|---|---|---|
--local | path | Yes | None | Path to the local TraceSet. |
--devices | string | No | All available CUDA devices | Comma-separated CUDA devices such as cuda:0,cuda:1. |
--host | string | No | 0.0.0.0 | Host address for the HTTP server. |
--port | integer | No | 8000 | Port for the HTTP server. |
--warmup-runs | integer | No | 10 | Number of warmup runs before measurement. |
--iterations | integer | No | 50 | Number of benchmark iterations per trial. |
--num-trials | integer | No | 3 | Number of benchmark trials per workload. |
--rtol | float | No | 1e-2 | Relative tolerance for correctness checks. |
--atol | float | No | 1e-2 | Absolute tolerance for correctness checks. |
--timeout | integer | No | 300 | Per-solution evaluation timeout in seconds. |
--log-level | enum | No | INFO | Server log level. One of DEBUG, INFO, WARNING, or ERROR. |
Mental Model
The server evaluates one submittedSolution asynchronously:
Solution: The implementation you submit to the server.Task: The asynchronous evaluation job created for that submission.Trace: One evaluation result for one workload under that task.
task.statustracks task lifecycle:pending,running,completed, orfailed.traces[*].evaluation.statustracks the actual evaluation result for each workload, such asPASSED,COMPILE_ERROR,RUNTIME_ERROR, orTIMEOUT.
task.status = completed only means the task finished running. It does not mean the solution passed correctness checks.
API Reference
GET /definitions
Purpose
List available definitions in the loaded TraceSet.
Request
No request body.
Response
Returns an array of definition summaries.
| Field | Type | Description |
|---|---|---|
name | string | Definition name. |
description | string or null | Optional definition description. |
GET /definitions/{name}
Purpose
Return the full serialized Definition object for one definition.
Request
Path parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Definition name. |
Definition object.
Use this endpoint when you need the exact contract before writing a passing solution.
Errors
404: Definition not found.
GET /definitions/{name}/workloads
Purpose
List workloads for one definition.
Request
Path parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Definition name. |
Workload objects.
Use this endpoint to discover valid workload UUIDs for POST /evaluate.
Errors
404: Definition not found.
GET /workloads/{uuid}
Purpose
Return one workload by UUID.
Request
Path parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
uuid | string | Yes | Workload UUID. |
Workload object.
Errors
404: Workload not found.
POST /evaluate
Purpose
Submit one solution for evaluation.
Request
Request body fields:
| Field | Type | Required | Description |
|---|---|---|---|
solution | object | Yes | Full Solution object to evaluate. |
workload_uuids | string[] | No | Optional subset of workload UUIDs. If omitted, the server evaluates all workloads for the definition. |
Solution still needs to match the selected definition’s real inputs and outputs.
Response
Response fields:
| Field | Type | Description |
|---|---|---|
task_id | string | Identifier for the asynchronous evaluation task. |
normalized_solution_name | string | Server-normalized solution name after Solution.with_unique_name(). |
- The server normalizes the submitted solution name by calling
Solution.with_unique_name(). normalized_solution_nameis deterministic for the same solution content.- If the selected workloads are empty, the task is still created, but it later ends with
task.status = failed.
400:solution.definitiondoes not exist.
GET /tasks/{task_id}
Purpose
Get one task by ID.
Request
Path parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
task_id | string | Yes | Task identifier. |
| Parameter | Type | Required | Description |
|---|---|---|---|
timeout | float | No | Optional value in the range 0..3600. 0 means return immediately. A positive value enables long-polling until the task completes or the timeout expires. |
| Field | Type | Description |
|---|---|---|
task_id | string | Task identifier. |
status | string | Task lifecycle status: pending, running, completed, or failed. |
definition | string | Definition name associated with the submitted solution. |
solution | string | Normalized solution name used by the server. |
traces | object[] or null | Serialized trace results. Can be null while the task is still pending or running. |
error | string or null | Task-level failure message. Usually null unless status = failed. |
- If the task is still pending or running,
tracesmay benull. - If the task fails at the task level,
errorcontains the failure reason.
404: Task not found.
POST /tasks/batch
Purpose
Query multiple tasks in one request.
Request
Request body fields:
| Field | Type | Required | Description |
|---|---|---|---|
task_ids | string[] | Yes | Task IDs to query. Response order matches this array. |
timeout | float | No | Optional wait time in seconds. timeout <= 0 returns immediately. timeout > 0 waits until all tasks complete or the timeout expires. |
TaskResponse objects.
Each item has the following fields:
| Field | Type | Description |
|---|---|---|
task_id | string | Task identifier. |
status | string | Task lifecycle status: pending, running, completed, or failed. |
definition | string | Definition name associated with the submitted solution. |
solution | string | Normalized solution name used by the server. |
traces | object[] or null | Serialized trace results. Can be null while the task is still pending or running. |
error | string or null | Task-level failure message. Usually null unless status = failed. |
- Returns a list of
TaskResponseobjects in the same order astask_ids. - Duplicate task IDs are allowed and produce duplicate results.
404: At least one task ID does not exist. The request is fail-fast.
GET /health
Purpose
Return worker health and queue depth.
Request
No request body.
Response
Response fields:
| Field | Type | Description |
|---|---|---|
status | string | Overall server health status. |
workers | object[] | Per-worker health information. |
queue_size | integer | Number of queued tasks waiting to run. |
POST /shutdown (Management)
Purpose
Ask the current server process to exit gracefully.
Request
No request body.
Response
Response fields:
| Field | Type | Description |
|---|---|---|
status | string | Shutdown acknowledgement, currently shutting_down. |
Polling And Error Semantics
Keep these semantics in mind when integrating with the server:task.status = completedmeans the task finished, not that the solution passed.- Look at
traces[*].evaluation.statusfor correctness and performance outcomes. task.status = failedindicates task-level failures such as missing workloads or other failures that prevent evaluation from completing normally.- In
GET /tasks/{task_id},timeoutmust be in the range0..3600. - In
POST /tasks/batch,timeout <= 0returns immediately andtimeout > 0waits up to the provided value. POST /tasks/batchis fail-fast on invalid task IDs.
Minimal Runnable Example
This example shows the smallest end-to-end flow that works without depending on a specific kernel signature. It intentionally submits a Python solution with a syntax error, so the task should complete withCOMPILE_ERROR. That makes the example portable across trace datasets as long as you choose a definition that has at least one workload.
Requirements:
curljq- A running benchmark server
- Top-level
statusshould becomecompleted. traces[0].evaluation.statusshould beCOMPILE_ERROR.
PASSED instead, inspect GET /definitions/{name} and implement a real solution that matches that definition’s inputs and outputs.
Notes
- The server requires at least one CUDA device.
- Reference results are cached per
(definition, workload)inside each worker process. GET /healthis intended for operational checks rather than task inspection.- Submitted solution names are normalized before evaluation.

