FlashInfer-Bench

Leaderboard

Examine overall author performance across every kernel definition and workload.


1.gemini-2.5-pro	0.628x	73.1%(660 workloads)
2.gpt-5-2025-08-07	0.467x	92.3%(660 workloads)
3.claude-opus-4-1-20250805	0.456x	73.1%(660 workloads)
4.gpt-o3	0.450x	92.3%(660 workloads)

Models

Explore model architectures and their kernel implementations

View all

DeepSeek V3/R1

DeepSeek V3 and R1 models.

19 kernels

9/19 traced

Llama 3.1 8B

Meta's Llama 3.1 8B parameter model

12 kernels

8/12 traced

Qwen3 30 A3B

Qwen3 MoE 30B a3b model.

13 kernels

6/13 traced

Qwen3 Next 80B A3B

Qwen3 Next 80B with 3B active parameters. Hybrid architecture combining Gated DeltaNet (linear attention) and Gated Attention (standard GQA) with high-sparsity MoE.

17 kernels

7/17 traced

NemotronH-8B

NVIDIA NemotronH-8B hybrid architecture combining Mamba2 SSM layers with standard attention. 52 total layers: 24 Mamba (M), 4 Attention (*), 24 MLP-only (-). Mamba layers use FlashInfer selective_state_update for decode.

11 kernels

9/11 traced

MiniMax M2

MiniMax M2 model. 62 decoder layers, GQA attention (48 q-heads / 8 kv-heads), sparse MoE (256 experts, top-8 sigmoid routing).

14 kernels

8/14 traced

Kimi K2

Moonshot AI Kimi K2. 61 decoder layers, MLA attention (64 heads, TP=8 → h=8), sparse MoE (384 experts, top-8 DeepSeek routing, EP=8 → 48 local experts). Servable on 8×B200 (4×B200×2) at FP8.

16 kernels

9/16 traced

Llama 4 Scout 17B-16E

Meta Llama 4 Scout 17B-16E. 48 decoder layers with interleaved local (RoPE chunked, ×40) and global (NoPE, ×8) attention, MoE FFN (16 experts, top-1). TP=8 on 8×B200 (4×B200×2): 40/8=5 q-heads, 8/8=1 kv-head.

12 kernels

11/12 traced

Loading kernels…