Model Kernel Coverage

This document tracks which kernels are supported in FlashInfer-Bench for each model.

✅ Definition JSON exists and workload has been collected
🟡 Definition JSON exists but workload has not yet been collected
❌ Definition is referenced in models.ts but the file does not exist (missing)
— Module exists in the architecture but no definition is mapped (unmapped)

Summary

Model	Architecture	Coverage
DeepSeek V3/R1	MLA + Dense/MoE	🟡 Partial
DeepSeek V3.2	DSA + Dense/MoE	✅ Fully covered
Llama 3.1 8B	GQA + Dense	✅ Fully covered
Llama 3.1/3.3 70B	GQA + Dense	🟡 Partial
Llama 3.2 3B	GQA + Dense	🟡 Partial
Mistral 7B v0.3	GQA + Dense	🟡 Partial
Mistral Nemo 12B	GQA + Dense	🟡 Partial
Mixtral 8x7B	GQA + MoE	🟡 Partial
Mixtral 8x22B	GQA + MoE	🟡 Partial
Qwen2.5 7B	GQA + Dense	🟡 Partial
Qwen2.5 72B	GQA + Dense	🟡 Partial
Qwen3 8B	GQA + Dense	🟡 Partial
Qwen3 30B A3B	GQA + MoE	🟡 Partial
Qwen3 32B	GQA + Dense	🟡 Partial
Qwen3 235B A22B	GQA + MoE	🟡 Partial
Qwen3 Next 80B A3B	GDN + GQA + MoE	🟡 Partial
Kimi K2	MLA + MoE	🟡 Partial
Phi-4 14B	GQA + Dense	🟡 Partial
Llama 3.1 405B	GQA + Dense	🟡 Partial
Llama 4 Scout 17B-16E	GQA + MoE	🟡 Partial
Llama 4 Maverick 17B-128E	GQA + MoE	🟡 Partial
Mistral Small 3.1 24B	GQA + Dense	🟡 Partial
GLM-4.6	GQA + Dense	❌ Not covered
MiniMax-Text-01	Lightning Attn + MoE	❌ Not covered
MiniMax M2	GQA + MoE	🟡 Partial
Gemma 3 27B	GQA + Dense	🟡 Partial
Qwen3 14B	GQA + Dense	🟡 Partial
NemotronH 47B	GQA + Mamba2 Hybrid	❌ Not covered

DeepSeek V3 / R1

Architecture: 61 decoder layers, MLA attention, hybrid Dense+MoE FFN

Definition	Op Type	Status
`rmsnorm_h7168`	rmsnorm	✅
`fused_add_rmsnorm_h7168`	rmsnorm	✅
`rmsnorm_h1536`	rmsnorm	✅
`rmsnorm_h512`	rmsnorm	✅
`gemm_n256_k7168`	gemm	✅
`mla_ragged_prefill_causal_h16_qk192_vo128`	mla_ragged	✅
`mla_paged_prefill_causal_h16_ckv512_kpe64_ps1`	mla_paged	✅
`mla_paged_prefill_causal_h16_ckv512_kpe64_ps64`	mla_paged	✅
`mla_paged_decode_h16_ckv512_kpe64_ps1`	mla_paged	✅
`mla_paged_decode_h16_ckv512_kpe64_ps64`	mla_paged	✅
`moe_fp8_block_scale_ds_routing_topk8_ng8_kg4_e32_h7168_i2048`	moe	✅
`top_k_sampling_from_probs_v129280`	sampling	✅
`top_k_top_p_sampling_from_probs_v129280`	sampling	✅
`top_p_sampling_from_probs_v129280`	sampling	✅

Coverage: 13 / 14 definitions present. Missing: MLA ragged prefill definition.

DeepSeek V3.2

Architecture: 61 decoder layers, DSA (DeepSeek Sparse Attention) replacing dense MLA, hybrid Dense+MoE FFN. Standard serving configuration: TP=8. DSA introduces a learned TopK indexer that selects a sparse subset of KV pages before running attention, reducing computation for long contexts while preserving accuracy.

Definition	Op Type	Status
`rmsnorm_h7168`	rmsnorm	✅
`fused_add_rmsnorm_h7168`	rmsnorm	✅
`rmsnorm_h1536`	rmsnorm	✅
`rmsnorm_h512`	rmsnorm	✅
`dsa_topk_indexer_fp8_h64_d128_topk2048_ps64`	dsa_paged	✅
`dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps1`	dsa_paged	✅
`dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps64`	dsa_paged	✅
`moe_fp8_block_scale_ds_routing_topk8_ng8_kg4_e32_h7168_i2048`	moe	✅

Coverage: 8 / 8 definitions present. Fully covered.

Llama 3.1 8B

Architecture: 32 decoder layers, GQA attention, dense MLP

Definition	Op Type	Status
`rmsnorm_h4096`	rmsnorm	✅
`fused_add_rmsnorm_h4096`	rmsnorm	✅
`gemm_n6144_k4096`	gemm	✅
`gemm_n4096_k4096`	gemm	✅
`gemm_n28672_k4096`	gemm	✅
`gemm_n4096_k14336`	gemm	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv8_d128`	gqa_ragged	✅
`top_k_sampling_from_probs_v128256`	sampling	✅
`top_k_top_p_sampling_from_probs_v128256`	sampling	✅
`top_p_sampling_from_probs_v128256`	sampling	✅

Coverage: 14 / 14 definitions present. Fully covered.

Qwen3 30B A3B

Architecture: 32 decoder layers, GQA attention, MoE FFN (30 MoE + 2 dense layers)

Definition	Op Type	Status
`rmsnorm_h128`	rmsnorm	✅
`rmsnorm_h2048`	rmsnorm	✅
`fused_add_rmsnorm_h2048`	rmsnorm	✅
`gemm_n128_k2048`	gemm	✅
`gemm_n2048_k4096`	gemm	✅
`gemm_n5120_k2048`	gemm	✅
`gqa_paged_prefill_causal_h32_kv4_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv4_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv4_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv4_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv4_d128`	gqa_ragged	✅
`top_k_sampling_from_probs_v151936`	sampling	✅
`top_k_top_p_sampling_from_probs_v151936`	sampling	✅
`top_p_sampling_from_probs_v151936`	sampling	✅
MoE gate / topk / experts	moe	—
`moe_fp8_block_scale_renorm_topk8_e128_h2048_i768`	moe EP=1	🟡
`trtllm_fp4_block_scale_moe_topk8_e128_h2048_i768`	moe (TRT-LLM FP4)	🟡
`trtllm_fp4_block_scale_routed_moe_topk8_e128_h2048_i768`	moe (TRT-LLM FP4 routed)	🟡
`trtllm_fp8_per_tensor_scale_moe_topk8_e128_h2048_i768`	moe (TRT-LLM FP8)	🟡

Coverage: 14 / 14 referenced definitions present. MoE kernels added (not yet mapped in models.ts).

Qwen3 Next 80B A3B

Architecture: 48 layers total — 36 GDN (linear attention) + 12 GQA (standard attention), all layers use MoE FFN. Standard serving configuration: TP=2 or TP=4.

Definition	Op Type	Status
`rmsnorm_h2048`	rmsnorm	✅
`fused_add_rmsnorm_h2048`	rmsnorm	✅
`gdn_prefill_qk16_v32_d128_k_last`	gdn TP=1	🟡
`gdn_prefill_qk8_v16_d128_k_last`	gdn TP=2	✅
`gdn_prefill_qk4_v8_d128_k_last`	gdn TP=4	✅
`gdn_decode_qk16_v32_d128_k_last`	gdn TP=1	🟡
`gdn_decode_qk8_v16_d128_k_last`	gdn TP=2	✅
`gdn_decode_qk4_v8_d128_k_last`	gdn TP=4	✅
`gdn_mtp_qk16_v32_d128_k_last`	gdn TP=1	🟡
`gdn_mtp_qk8_v16_d128_k_last`	gdn TP=2	✅
`gdn_mtp_qk4_v8_d128_k_last`	gdn TP=4	✅
`gqa_paged_prefill_causal_h8_kv1_d256_ps1`	gqa_paged TP=2	❌
`gqa_paged_decode_h8_kv1_d256_ps1`	gqa_paged TP=2	❌
`gqa_ragged_prefill_causal_h8_kv1_d256`	gqa_ragged TP=2	✅
MoE gate / topk / experts (GDN layers)	moe	—
MoE gate / topk / experts (GQA layers)	moe	—
`moe_fp8_block_scale_renorm_topk10_e128_h2048_i512`	moe EP=1	🟡
`trtllm_fp4_block_scale_moe_topk10_e128_h2048_i512`	moe (TRT-LLM FP4, EP=4)	🟡
`trtllm_fp4_block_scale_routed_moe_topk10_e128_h2048_i512`	moe (TRT-LLM FP4 routed, EP=4)	🟡

Coverage: 10 / 14 referenced definitions present. MoE definition added (shared across GDN and GQA layers). Missing GDN definitions: TP=1 prefill and decode (qk16_v32). Missing GQA: h=8, kv=1, d=256 (TP=2 of original h=16, kv=2, d=256).

Llama 3.1 / 3.3 70B

Architecture: 80 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=4 (from sgl-cookbook). Llama 3.1 70B and 3.3 70B share identical architecture dimensions; only training data and context window differ.

Definition	Op Type	Status
`rmsnorm_h8192`	rmsnorm	❌
`fused_add_rmsnorm_h8192`	rmsnorm	❌
`gqa_paged_prefill_causal_h16_kv2_d128_ps1`	gqa_paged TP=4	✅
`gqa_paged_prefill_causal_h16_kv2_d128_ps64`	gqa_paged TP=4	✅
`gqa_paged_decode_h16_kv2_d128_ps1`	gqa_paged TP=4	✅
`gqa_paged_decode_h16_kv2_d128_ps64`	gqa_paged TP=4	✅
`gqa_ragged_prefill_causal_h16_kv2_d128`	gqa_ragged TP=4	✅
`gemm_n10240_k8192`	gemm	🟡
`gemm_n8192_k8192`	gemm	🟡
`gemm_n57344_k8192`	gemm	🟡
`gemm_n8192_k28672`	gemm	🟡
`top_k_sampling_from_probs_v128256`	sampling	✅
`top_k_top_p_sampling_from_probs_v128256`	sampling	✅
`top_p_sampling_from_probs_v128256`	sampling	✅

Coverage: 14 / 14 definitions present. GQA kernels shared with Qwen3-32B (same h=16, kv=2, d=128 at TP=4).

Llama 3.2 3B

Architecture: 28 decoder layers, GQA attention, dense MLP.

Definition	Op Type	Status
`rmsnorm_h3072`	rmsnorm	🟡
`fused_add_rmsnorm_h3072`	rmsnorm	🟡
`gqa_paged_prefill_causal_h24_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h24_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h24_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h24_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h24_kv8_d128`	gqa_ragged	✅
`gemm_n5120_k3072`	gemm	🟡
`gemm_n3072_k3072`	gemm	🟡
`gemm_n16384_k3072`	gemm	🟡
`gemm_n3072_k8192`	gemm	🟡
`top_k_sampling_from_probs_v128256`	sampling	✅
`top_k_top_p_sampling_from_probs_v128256`	sampling	✅
`top_p_sampling_from_probs_v128256`	sampling	✅

Coverage: 6 / 14 definitions present. GQA ragged prefill kernel added. Missing: rmsnorm h3072, paged GQA variants, and GEMM kernels for hidden=3072.

Mistral 7B v0.3

Architecture: 32 decoder layers, GQA attention, dense MLP. Shares identical hidden, attention, and MLP dimensions with Llama 3.1 8B (hidden=4096, 32q/8kv heads, intermediate=14336).

Definition	Op Type	Status
`rmsnorm_h4096`	rmsnorm	✅
`fused_add_rmsnorm_h4096`	rmsnorm	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv8_d128`	gqa_ragged	✅
`gemm_n6144_k4096`	gemm	✅
`gemm_n4096_k4096`	gemm	✅
`gemm_n28672_k4096`	gemm	✅
`gemm_n4096_k14336`	gemm	✅
`top_k_sampling_from_probs_v32000`	sampling	❌
`top_k_top_p_sampling_from_probs_v32000`	sampling	❌
`top_p_sampling_from_probs_v32000`	sampling	❌

Coverage: 11 / 14 definitions present. Missing: sampling definitions for vocab_size=32000.

Mistral Nemo 12B

Architecture: 40 decoder layers, GQA attention (explicit head_dim=128), dense MLP. Standard serving configuration: TP=1 (from sgl-cookbook).

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	🟡
`fused_add_rmsnorm_h5120`	rmsnorm	🟡
`gqa_paged_prefill_causal_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv8_d128`	gqa_ragged	✅
`gemm_n6144_k5120`	gemm	❌
`gemm_n5120_k4096`	gemm	❌
`gemm_n28872_k5120`	gemm	❌
`gemm_n5120_k14436`	gemm	❌
`top_k_sampling_from_probs_v131072`	sampling	❌
`top_k_top_p_sampling_from_probs_v131072`	sampling	❌
`top_p_sampling_from_probs_v131072`	sampling	❌

Coverage: 7 / 14 definitions present. GQA defs are shared with Llama 3.1 8B; rmsnorm h5120 is now shared with Qwen3 14B. Missing: all GEMM (different hidden=5120 input dim), sampling v131072.

Mixtral 8x7B

Architecture: 32 decoder layers, GQA attention, sparse MoE FFN (8 experts, top-2 routing). Shares attention and normalization dimensions with Llama 3.1 8B / Mistral 7B.

Definition	Op Type	Status
`rmsnorm_h4096`	rmsnorm	✅
`fused_add_rmsnorm_h4096`	rmsnorm	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv8_d128`	gqa_ragged	✅
`gemm_n6144_k4096`	gemm	✅
`gemm_n4096_k4096`	gemm	✅
MoE experts (top-2, 8 experts, inter=14336)	moe	—
`top_k_sampling_from_probs_v32000`	sampling	❌
`top_k_top_p_sampling_from_probs_v32000`	sampling	❌
`top_p_sampling_from_probs_v32000`	sampling	❌

Coverage: 9 / 12 referenced definitions present. MoE uses standard top-2 routing (not DeepSeek FP8 block-scale), so the existing MoE definition does not apply (unmapped). Missing: sampling v32000.

Mixtral 8x22B

Architecture: 56 decoder layers, GQA attention, sparse MoE FFN (8 experts, top-2 routing). All dimensions are new (hidden=6144, 48q/8kv heads). Standard serving configuration: TP=2 (from sgl-cookbook), giving 24q/4kv heads per GPU.

Definition	Op Type	Status
`rmsnorm_h6144`	rmsnorm	❌
`fused_add_rmsnorm_h6144`	rmsnorm	❌
`gqa_paged_prefill_causal_h24_kv4_d128_ps64`	gqa_paged	✅
`gqa_paged_prefill_causal_h48_kv8_d128_ps1`	gqa_paged	❌
`gqa_paged_prefill_causal_h48_kv8_d128_ps64`	gqa_paged	❌
`gqa_paged_decode_h48_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h48_kv8_d128_ps64`	gqa_paged	❌
`gqa_ragged_prefill_causal_h48_kv8_d128`	gqa_ragged	❌
`gemm_n8192_k6144`	gemm	❌
`gemm_n6144_k6144`	gemm	❌
MoE experts (top-2, 8 experts, inter=16384)	moe	—
`top_k_sampling_from_probs_v32768`	sampling	❌
`top_k_top_p_sampling_from_probs_v32768`	sampling	❌
`top_p_sampling_from_probs_v32768`	sampling	❌

Coverage: 3 / 13 referenced definitions present. TP=2 prefill + decode definitions added. Missing: rmsnorm, remaining GQA variants, GEMM, and sampling definitions.

Mixtral 8x22B at TP=2

At tensor parallelism TP=2, attention head counts are halved (48→24 q-heads, 8→4 kv-heads).

Definition	Op Type	Status
`gqa_paged_prefill_causal_h24_kv4_d128_ps64`	gqa_paged	✅
`gqa_paged_prefill_causal_h24_kv4_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h24_kv4_d128_ps64`	gqa_paged	✅

Coverage: 3 / 3 TP=2 attention definitions present (prefill ps1 + ps64, decode ps64).

Qwen2.5 7B

Architecture: 28 decoder layers, GQA attention, dense MLP.

Definition	Op Type	Status
`rmsnorm_h3584`	rmsnorm	❌
`fused_add_rmsnorm_h3584`	rmsnorm	❌
`gqa_paged_prefill_causal_h28_kv4_d128_ps1`	gqa_paged	❌
`gqa_paged_prefill_causal_h28_kv4_d128_ps64`	gqa_paged	❌
`gqa_paged_decode_h28_kv4_d128_ps1`	gqa_paged	❌
`gqa_paged_decode_h28_kv4_d128_ps64`	gqa_paged	❌
`gqa_ragged_prefill_causal_h28_kv4_d128`	gqa_ragged	❌
`gemm_n4608_k3584`	gemm	🟡
`gemm_n3584_k3584`	gemm	🟡
`gemm_n37888_k3584`	gemm	🟡
`gemm_n3584_k18944`	gemm	🟡
`top_k_sampling_from_probs_v152064`	sampling	🟡
`top_k_top_p_sampling_from_probs_v152064`	sampling	🟡
`top_p_sampling_from_probs_v152064`	sampling	🟡

Coverage: 9 / 14 definitions present. Missing: all rmsnorm, GQA, and GEMM definitions for hidden=3584.

Qwen2.5 72B

Architecture: 80 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=8 (from sgl-cookbook).

Definition	Op Type	Status
`rmsnorm_h8192`	rmsnorm	❌
`fused_add_rmsnorm_h8192`	rmsnorm	❌
`gqa_paged_prefill_causal_h8_kv1_d128_ps1`	gqa_paged TP=8	❌
`gqa_paged_prefill_causal_h8_kv1_d128_ps64`	gqa_paged TP=8	❌
`gqa_paged_decode_h8_kv1_d128_ps1`	gqa_paged TP=8	❌
`gqa_paged_decode_h8_kv1_d128_ps64`	gqa_paged TP=8	❌
`gqa_ragged_prefill_causal_h8_kv1_d128`	gqa_ragged TP=8	❌
`gemm_n10240_k8192`	gemm	❌
`gemm_n8192_k8192`	gemm	❌
`gemm_n59392_k8192`	gemm	❌
`gemm_n8192_k29696`	gemm	❌
`top_k_sampling_from_probs_v152064`	sampling	🟡
`top_k_top_p_sampling_from_probs_v152064`	sampling	🟡
`top_p_sampling_from_probs_v152064`	sampling	🟡

Coverage: 3 / 14 definitions present. Missing: rmsnorm h8192, all GQA definitions (h8_kv1_d128 at TP=8), all GEMM definitions for hidden=8192.

Qwen3 8B

Architecture: 36 decoder layers, GQA attention, dense MLP. Shares hidden size and attention dimensions with Llama 3.1 8B (hidden=4096, 32q/8kv heads, head_dim=128), but uses a larger MLP intermediate size (22016 vs 14336).

Definition	Op Type	Status
`rmsnorm_h4096`	rmsnorm	✅
`fused_add_rmsnorm_h4096`	rmsnorm	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv8_d128`	gqa_ragged	✅
`gemm_n6144_k4096`	gemm	✅
`gemm_n4096_k4096`	gemm	✅
`gemm_n44032_k4096`	gemm	❌
`gemm_n4096_k22016`	gemm	❌
`top_k_sampling_from_probs_v151936`	sampling	✅
`top_k_top_p_sampling_from_probs_v151936`	sampling	✅
`top_p_sampling_from_probs_v151936`	sampling	✅

Coverage: 12 / 14 definitions present. Missing: gate_up GEMM (gemm_n44032_k4096, intermediate=22016 × 2) and down GEMM (gemm_n4096_k22016). All normalization, attention, and non-MLP GEMM kernels are shared with Llama 3.1 8B.

Qwen3 32B

Architecture: 64 decoder layers, GQA attention, dense MLP. hidden=5120, 64 query heads, 8 KV heads, head_dim=128, intermediate=25600. Standard serving configuration: TP=4.

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	✅
`fused_add_rmsnorm_h5120`	rmsnorm	✅
`gqa_paged_prefill_causal_h16_kv2_d128_ps1`	gqa_paged TP=4	✅
`gqa_paged_prefill_causal_h16_kv2_d128_ps64`	gqa_paged TP=4	✅
`gqa_paged_decode_h16_kv2_d128_ps1`	gqa_paged TP=4	✅
`gqa_paged_decode_h16_kv2_d128_ps64`	gqa_paged TP=4	✅
`gqa_ragged_prefill_causal_h16_kv2_d128`	gqa_ragged TP=4	✅
`gemm_n10240_k5120`	gemm (QKV)	❌
`gemm_n5120_k8192`	gemm (o_proj)	❌
`gemm_n51200_k5120`	gemm (gate_up)	❌
`gemm_n5120_k25600`	gemm (down)	❌
`top_k_sampling_from_probs_v151936`	sampling	✅
`top_k_top_p_sampling_from_probs_v151936`	sampling	✅
`top_p_sampling_from_probs_v151936`	sampling	✅

Coverage: 10 / 14 definitions present. RMSNorm shared with Qwen3 14B (same hidden=5120). GQA kernels shared with Llama 3.1/3.3 70B (same h=16, kv=2, d=128 at TP=4). Missing: all GEMM definitions.

Qwen3 235B A22B

Architecture: 94 decoder layers, GQA attention, sparse MoE FFN (128 experts, top-8 routing). Uses head_dim=128 (hidden=8192, 64 query heads). Standard serving configuration: TP=8, EP=2 (FP8 variant from sgl-cookbook). With 4 KV heads, effective per-device TP for attention is TP=4 (kv=1 per device).

Definition	Op Type	Status
`rmsnorm_h4096`	rmsnorm	✅
`fused_add_rmsnorm_h4096`	rmsnorm	✅
`gqa_paged_prefill_causal_h16_kv1_d128_ps1`	gqa_paged TP=4	❌
`gqa_paged_prefill_causal_h16_kv1_d128_ps64`	gqa_paged TP=4	✅
`gqa_paged_decode_h16_kv1_d128_ps1`	gqa_paged TP=4	❌
`gqa_paged_decode_h16_kv1_d128_ps64`	gqa_paged TP=4	❌
`gqa_ragged_prefill_causal_h16_kv1_d128`	gqa_ragged TP=4	❌
`gemm_n4608_k4096`	gemm	❌
`gemm_n4096_k4096`	gemm	✅
`moe_fp8_block_scale_renorm_topk8_e128_h4096_i1536`	moe EP=1	🟡
`moe_fp8_block_scale_ds_routing_topk8_ng?_kg?_e64_h4096_i1536`	moe EP=2	❌
`trtllm_fp4_block_scale_moe_topk8_e64_h4096_i1536`	moe (TRT-LLM FP4, EP=2)	🟡
`trtllm_fp4_block_scale_routed_moe_topk8_e64_h4096_i1536`	moe (TRT-LLM FP4 routed, EP=2)	🟡
`top_k_sampling_from_probs_v151936`	sampling	✅
`top_k_top_p_sampling_from_probs_v151936`	sampling	✅
`top_p_sampling_from_probs_v151936`	sampling	✅

Coverage: 8 / 14 referenced definitions present. MoE EP=1 + TRT-LLM FP4 EP=2 definitions added. Missing: all GQA defs (head_dim=64), QKV GEMM. The o_proj GEMM and rmsnorm are shared with other h=4096 models.

Kimi K2

Architecture: 61 decoder layers, MLA attention (same structure as DeepSeek V3), sparse MoE FFN (384 total experts, top-8 routing). Standard serving configuration: TP=8, EP=4 (from sgl-cookbook). Kimi K2 uses DeepSeek V3-style MLA with the same kv_lora_rank=512 and qk_rope_head_dim=64, but has 64 attention heads (vs 128 in DeepSeek V3). With TP=8 this gives h=8, requiring separate MLA definitions from DeepSeek V3’s h=16.

Definition	Op Type	Status
`rmsnorm_h7168`	rmsnorm	✅
`fused_add_rmsnorm_h7168`	rmsnorm	✅
`rmsnorm_h1536`	rmsnorm	✅
`rmsnorm_h512`	rmsnorm	✅
`mla_paged_prefill_causal_h8_ckv512_kpe64_ps1`	mla_paged TP=8	🟡
`mla_paged_prefill_causal_h8_ckv512_kpe64_ps64`	mla_paged TP=8	❌
`mla_paged_decode_h8_ckv512_kpe64_ps1`	mla_paged TP=8	🟡
`mla_paged_decode_h8_ckv512_kpe64_ps64`	mla_paged TP=8	❌
`mla_ragged_prefill_causal_h8_qk192_vo128`	mla_ragged	🟡
`moe_fp8_block_scale_ds_routing_topk8_ng1_kg1_e384_h7168_i2048`	moe EP=1	🟡
`moe_fp8_block_scale_ds_routing_topk8_ng?_kg?_e96_h7168_i2048`	moe EP=4	❌
`moe_fp8_block_scale_ds_routing_topk8_ng1_kg1_e48_h7168_i2048`	moe EP=8	🟡
`top_k_sampling_from_probs_v160000`	sampling	❌
`top_k_top_p_sampling_from_probs_v160000`	sampling	❌
`top_p_sampling_from_probs_v160000`	sampling	❌

Coverage: 6 / 15 definitions present. RMSNorm definitions are shared with DeepSeek V3 (same hidden=7168 and sub-module dims). MoE EP=1 and EP=8 definitions added. All MLA defs require new h=8 variants; MoE EP=4 variant (e=96) and sampling (v=160000) still missing.

Phi-4 14B

Architecture: 40 decoder layers, GQA attention (unusual 10 KV heads), dense MLP. All dimensions are new for this project.

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	🟡
`fused_add_rmsnorm_h5120`	rmsnorm	🟡
`gqa_paged_prefill_causal_h40_kv10_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h40_kv10_d128_ps64`	gqa_paged	❌
`gqa_paged_decode_h40_kv10_d128_ps1`	gqa_paged	❌
`gqa_paged_decode_h40_kv10_d128_ps64`	gqa_paged	❌
`gqa_ragged_prefill_causal_h40_kv10_d128`	gqa_ragged	❌
`gemm_n7680_k5120`	gemm	❌
`gemm_n5120_k5120`	gemm	🟡
`gemm_n35840_k5120`	gemm	❌
`gemm_n5120_k17920`	gemm	❌
`top_k_sampling_from_probs_v100352`	sampling	❌
`top_k_top_p_sampling_from_probs_v100352`	sampling	❌
`top_p_sampling_from_probs_v100352`	sampling	❌

Coverage: 4 / 14 definitions present. rmsnorm h5120 is now shared with Qwen3 14B; gemm_n5120_k5120 (o_proj shape) is shared since 40q*128=5120=hidden; gqa_paged_prefill_causal_h40_kv10_d128_ps1 has workloads collected (20/20 PASSED). Missing: remaining GQA defs (unusual 10 KV-head config), most GEMM, sampling v100352.

Llama 3.1 405B

Architecture: 126 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=4 (from sgl-cookbook). Uses the same Llama architecture as Llama 3.1 8B / 3.3 70B but at significantly larger scale (hidden=16384).

Definition	Op Type	Status
`rmsnorm_h16384`	rmsnorm	❌
`fused_add_rmsnorm_h16384`	rmsnorm	❌
`gqa_paged_prefill_causal_h32_kv2_d128_ps1`	gqa_paged TP=4	❌
`gqa_paged_prefill_causal_h32_kv2_d128_ps64`	gqa_paged TP=4	❌
`gqa_paged_decode_h32_kv2_d128_ps1`	gqa_paged TP=4	❌
`gqa_paged_decode_h32_kv2_d128_ps64`	gqa_paged TP=4	❌
`gqa_ragged_prefill_causal_h32_kv2_d128`	gqa_ragged TP=4	❌
`gemm_n18432_k16384`	gemm	❌
`gemm_n16384_k16384`	gemm	❌
`gemm_n106496_k16384`	gemm	❌
`gemm_n16384_k53248`	gemm	❌
`top_k_sampling_from_probs_v128256`	sampling	✅
`top_k_top_p_sampling_from_probs_v128256`	sampling	✅
`top_p_sampling_from_probs_v128256`	sampling	✅

Coverage: 3 / 14 definitions present. Sampling definitions are shared with Llama 3.1 8B (same vocab). Missing: rmsnorm h16384 and all GQA/GEMM definitions for this scale (TP=4 gives h=128/4=32 q-heads, kv=8/4=2 — the h32_kv2 configuration does not exist in current definitions).

Llama 4 Scout 17B-16E

Architecture: 48 decoder layers, interleaved GQA attention (NoPE global + RoPE local in 1:3 ratio), sparse MoE FFN (16 total experts, top-1 routing). Standard serving configuration: TP=8 (from sgl-cookbook). Multimodal (vision+text).

Note: Exact config.json values (hidden_size, intermediate_size) are pending verification from HuggingFace. Parameters below are estimates from the public model spec (17B activated parameters, 16 experts).

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	🟡
`fused_add_rmsnorm_h5120`	rmsnorm	🟡
`gqa_paged_prefill_causal_h5_kv1_d128_ps1`	gqa_paged TP=8	✅
`gqa_paged_prefill_causal_h5_kv1_d128_ps64`	gqa_paged TP=8	🟡
`gqa_paged_decode_h5_kv1_d128_ps1`	gqa_paged TP=8	✅
`gqa_paged_decode_h5_kv1_d128_ps64`	gqa_paged TP=8	🟡
`gqa_ragged_prefill_causal_h5_kv1_d128`	gqa_ragged TP=8	🟡
MoE experts (top-1, 16 experts, standard routing)	moe	—
`trtllm_fp4_block_scale_moe_topk1_e16_h5120_i8192`	moe (TRT-LLM FP4, Llama4 routing)	🟡
`trtllm_fp4_block_scale_routed_moe_topk1_e16_h5120_i8192`	moe (TRT-LLM FP4 routed, Llama4 routing)	🟡
`trtllm_fp8_per_tensor_scale_moe_topk1_e16_h5120_i8192`	moe (TRT-LLM FP8)	🟡
`top_k_sampling_from_probs_v202048`	sampling	🟡
`top_k_top_p_sampling_from_probs_v202048`	sampling	🟡
`top_p_sampling_from_probs_v202048`	sampling	🟡

Coverage: 8 / 13 definitions present. rmsnorm h5120 shared with Qwen3 14B. TRT-LLM FP4 + FP8 MoE kernels added (top-1, 16 experts, Llama4 routing). Missing: ps64 GQA variants (no definition files), sampling v202048.

Llama 4 Maverick 17B-128E

Architecture: Same base architecture as Llama 4 Scout but with 128 total experts (vs 16). Standard serving configuration: TP=8 (from sgl-cookbook). hidden_size=5120, 40 q-heads, 8 kv-heads, head_dim=128, intermediate_size=8192.

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	🟡
`fused_add_rmsnorm_h5120`	rmsnorm	🟡
`gqa_paged_prefill_causal_h5_kv1_d128_ps1`	gqa_paged TP=8	✅
`gqa_paged_prefill_causal_h5_kv1_d128_ps64`	gqa_paged TP=8	🟡
`gqa_paged_decode_h5_kv1_d128_ps1`	gqa_paged TP=8	✅
`gqa_paged_decode_h5_kv1_d128_ps64`	gqa_paged TP=8	🟡
`gqa_ragged_prefill_causal_h5_kv1_d128`	gqa_ragged TP=8	🟡
MoE experts (top-1, 128 experts, standard routing)	moe	—
`trtllm_fp4_block_scale_moe_topk1_e128_h5120_i8192`	moe (TRT-LLM FP4, Llama4 routing)	🟡
`trtllm_fp4_block_scale_routed_moe_topk1_e128_h5120_i8192`	moe (TRT-LLM FP4 routed, Llama4 routing)	🟡
`trtllm_fp8_per_tensor_scale_moe_topk1_e128_h5120_i8192`	moe (TRT-LLM FP8)	🟡
`top_k_sampling_from_probs_v202048`	sampling	🟡
`top_k_top_p_sampling_from_probs_v202048`	sampling	🟡
`top_p_sampling_from_probs_v202048`	sampling	🟡

Coverage: 10 / 13 definitions present. rmsnorm h5120 shared with Qwen3 14B. TRT-LLM FP4 + FP8 MoE kernels added (top-1, 128 experts, Llama4 routing). Same base dimensions as Llama 4 Scout; MoE expert count differs. ps1 GQA workloads collected; ps64 definitions present but workloads pending.

Mistral Small 3.1 24B

Architecture: 40 decoder layers, GQA attention (explicit head_dim=128), dense MLP. Standard serving configuration: TP=2 (from sgl-cookbook). Shares the same attention configuration as Mistral Nemo 12B (hidden=5120 with explicit head_dim=128 giving 32 effective query heads).

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	🟡
`fused_add_rmsnorm_h5120`	rmsnorm	🟡
`gqa_paged_prefill_causal_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_prefill_causal_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps1`	gqa_paged	✅
`gqa_paged_decode_h32_kv8_d128_ps64`	gqa_paged	✅
`gqa_ragged_prefill_causal_h32_kv8_d128`	gqa_ragged	✅
`gemm_n6144_k5120`	gemm	❌
`gemm_n5120_k4096`	gemm	❌
`gemm_n28672_k5120`	gemm	❌
`gemm_n5120_k14336`	gemm	❌
`top_k_sampling_from_probs_v131072`	sampling	❌
`top_k_top_p_sampling_from_probs_v131072`	sampling	❌
`top_p_sampling_from_probs_v131072`	sampling	❌

Coverage: 7 / 14 definitions present. GQA kernels are shared with Mistral Nemo 12B and Llama 3.1 8B; rmsnorm h5120 is now shared with Qwen3 14B. Missing: GEMM defs with k=5120 input dim (Mistral-specific intermediate sizes), sampling v131072.

GLM-4.6

Architecture: Dense transformer with Dual Chunk Attention (DCA) — a variant of full attention with rotary embeddings. Served on Together AI and Fireworks; sgl-cookbook shows TP=8, EP=8 (high-throughput configuration), suggesting a very large MoE variant.

Note: Exact architecture parameters for GLM-4.6 require verification from the HuggingFace config.json (zai-org/GLM-4.6). The params below are based on the SGLang glm4.py defaults and may not reflect the actual model dimensions.

Definition	Op Type	Status
`rmsnorm_h4096`	rmsnorm	❌ (if hidden=4096)
`fused_add_rmsnorm_h4096`	rmsnorm	❌
GQA or custom DCA attention	attention	—
MoE FFN (if applicable)	moe	—
Sampling (vocab TBD)	sampling	—

Coverage: 0 / ? definitions present. Architecture requires research. DCA attention may use standard GQA kernels at the computation level (FlashInfer’s paged/ragged wrappers) or require custom handling. Run /track-models --model-name glm46 --hf-repo-id zai-org/GLM-4.6 to fetch the exact config and update this section.

MiniMax-Text-01

Architecture: Hybrid linear + softmax attention with MoE FFN. Uses a 7:1 ratio of Lightning Attention (linear) to standard Softmax Attention layers per 8-layer block, plus sparse MoE (32 experts, top-2 routing). Total parameters: ~456B with ~45.9B activated. 80 decoder layers, 64 attention heads, head_dim=128, hidden_size=6144. Lightning Attention is a novel linear attention variant that does not use the standard softmax attention mechanism. It is not currently supported by FlashInfer and requires a new op type.

Definition	Op Type	Status
`rmsnorm_h6144`	rmsnorm	❌
`fused_add_rmsnorm_h6144`	rmsnorm	❌
Lightning Attention layers (7/8 of all layers)	lightning_attn	❌ (op type not supported)
Softmax Attention layers (1/8 of all layers)	gqa_paged	❌
MoE experts (top-2, 32 experts)	moe	—
Sampling (vocab TBD)	sampling	—

Coverage: 0 / ? definitions present. The primary blocker is Lightning Attention — a linear attention variant not yet in FlashInfer. The softmax attention layers (GQA-style) also require new definitions for this model’s specific dimensions. To add support, a new lightning_attn op type would first need to be defined.

MiniMax M2

Architecture: 62 decoder layers, GQA attention (6:1 ratio, 48 q-heads / 8 kv-heads, head_dim=128, hidden_size=3072), MoE FFN with sigmoid routing (256 experts, top-8, FP8 block-scale quantization). Total parameters: ~230B with ~10B activated. Note: MiniMax M2 is a separate model from MiniMax-Text-01 (which uses Lightning Attention). M2 uses standard GQA attention.

Definition	Op Type	Status
`rmsnorm_h3072`	rmsnorm	🟡
`fused_add_rmsnorm_h3072`	rmsnorm	🟡
`rope_with_cos_sin_cache_neox_style_d128_rd64`	rope	🟡
`gqa_paged_prefill_causal_h6_kv1_d128_ps1`	gqa_paged	🟡
`gqa_paged_prefill_causal_h6_kv1_d128_ps64`	gqa_paged	🟡
`gqa_paged_decode_h6_kv1_d128_ps1`	gqa_paged	🟡
`gqa_paged_decode_h6_kv1_d128_ps64`	gqa_paged	🟡
`gqa_ragged_prefill_causal_h6_kv1_d128`	gqa_ragged	🟡
`gemm_n8192_k3072`	gemm (fused qkv_proj)	🟡
`gemm_n3072_k6144`	gemm (o_proj)	🟡
`gemm_n256_k3072`	gemm (MoE gate)	🟡
MoE gate / topk / experts	moe	—
`top_k_sampling_from_probs_v200064`	sampling	🟡
`top_k_top_p_sampling_from_probs_v200064`	sampling	🟡
`top_p_sampling_from_probs_v200064`	sampling	🟡

Coverage: 14 / 15 definitions present. Workloads not yet collected.

Gemma 3 27B

Architecture: 62 decoder layers, GQA attention (2:1 ratio, 32 q-heads / 16 kv-heads, explicit head_dim=128 decoupled from hidden_size=5376), dense MLP with GeGLU activation. Note: hidden_size=5376 is non-standard; head_dim is explicitly 128 (not 5376/32=168). This is a multimodal model (vision+text) but the language backbone uses standard transformer attention.

Definition	Op Type	Status
`rmsnorm_h5376`	rmsnorm	🟡
`fused_add_rmsnorm_h5376`	rmsnorm	🟡
`gqa_paged_prefill_causal_h32_kv16_d128_ps1`	gqa_paged	🟡
`gqa_paged_prefill_causal_h32_kv16_d128_ps64`	gqa_paged	🟡
`gqa_paged_decode_h32_kv16_d128_ps1`	gqa_paged	🟡
`gqa_paged_decode_h32_kv16_d128_ps64`	gqa_paged	🟡
`gqa_ragged_prefill_causal_h32_kv16_d128`	gqa_ragged	✅
`gemm_n4096_k5376`	gemm (q_proj)	🟡
`gemm_n2048_k5376`	gemm (k/v proj)	🟡
`gemm_n5376_k4096`	gemm (o_proj)	🟡
`gemm_n21504_k5376`	gemm (gate/up proj)	🟡
`gemm_n5376_k21504`	gemm (down proj)	🟡
`top_k_sampling_from_probs_v262208`	sampling	🟡
`top_k_top_p_sampling_from_probs_v262208`	sampling	🟡
`top_p_sampling_from_probs_v262208`	sampling	🟡

Coverage: 15 / 15 definitions present. All dimensions are unique to this model: hidden=5376, intermediate=21504, vocab=262208. GQA ratio is 2:1 (vs 4:1 for Llama/Qwen), so kv_heads=16 (not 8). Workloads not yet collected.

Qwen3 14B

Architecture: 40 decoder layers, GQA attention (5:1 ratio, 40 q-heads / 8 kv-heads, head_dim=128), dense MLP. Standard serving configuration: TP=2 (from sgl-cookbook), giving 20 q-heads and 4 kv-heads per device.

Definition	Op Type	Status
`rmsnorm_h5120`	rmsnorm	🟡
`fused_add_rmsnorm_h5120`	rmsnorm	🟡
`gqa_paged_prefill_causal_h20_kv4_d128_ps1`	gqa_paged TP=2	🟡
`gqa_paged_prefill_causal_h20_kv4_d128_ps64`	gqa_paged TP=2	🟡
`gqa_paged_decode_h20_kv4_d128_ps1`	gqa_paged TP=2	🟡
`gqa_paged_decode_h20_kv4_d128_ps64`	gqa_paged TP=2	🟡
`gqa_ragged_prefill_causal_h20_kv4_d128`	gqa_ragged TP=2	✅
`gemm_n7168_k5120`	gemm (qkv_proj combined)	🟡
`gemm_n5120_k5120`	gemm (o_proj)	🟡
`gemm_n34816_k5120`	gemm (gate_up combined)	🟡
`gemm_n5120_k17408`	gemm (down proj)	🟡
`top_k_sampling_from_probs_v151936`	sampling	✅
`top_k_top_p_sampling_from_probs_v151936`	sampling	✅
`top_p_sampling_from_probs_v151936`	sampling	✅

Coverage: 14 / 14 definitions present. The rmsnorm_h5120 definition is also shared with Mistral Nemo 12B, Mistral Small 3.1 24B, Phi-4 14B, and Llama 4 Scout/Maverick. Non-sampling workloads not yet collected.

NemotronH 47B

Architecture: 52 decoder layers total — hybrid of standard GQA (Transformer) and Mamba2 SSM layers. Uses 20 GQA attention layers and 32 Mamba2 layers in an interleaved pattern. Standard serving configuration: TP=8 (from sgl-cookbook). Mamba2 SSM (Structured State Space Model) is a linear recurrent architecture that does not use softmax attention. It maintains a fixed-size state matrix updated at each step, analogous to a hidden state in RNNs. Mamba2 is not currently supported as an op type in FlashInfer-Bench and requires defining a new mamba_ssu (Selective State-space Unit) operation type before this model can be tracked.

Definition	Op Type	Status
`rmsnorm_h{hidden}`	rmsnorm	❌ (dims TBD)
GQA attention layers (20 layers, TP=8)	gqa_paged	❌
Mamba2 SSM layers (32 layers)	mamba_ssu	❌ (op type not supported)
MLP / MoE FFN	gemm / moe	❌
Sampling	sampling	❌

Coverage: 0 / ? definitions present. The primary blocker is the Mamba2 SSM op type — a selective state-space operation not yet defined in FlashInfer-Bench. This is analogous to MiniMax-Text-01’s Lightning Attention blocker. To add support, a new mamba_ssu op type schema would first need to be defined. Once that exists, the GQA attention layers could reuse existing definitions if dimensions match.

Getting Started

Tutorials

FlashInfer Trace

Dataset

Op Type Reference

Model Coverage

Model Kernel Coverage

Summary

DeepSeek V3 / R1

DeepSeek V3.2

Llama 3.1 8B

Qwen3 30B A3B

Qwen3 Next 80B A3B

Llama 3.1 / 3.3 70B

Llama 3.2 3B

Mistral 7B v0.3

Mistral Nemo 12B

Mixtral 8x7B

Mixtral 8x22B

Mixtral 8x22B at TP=2

Qwen2.5 7B

Qwen2.5 72B

Qwen3 8B

Qwen3 32B

Qwen3 235B A22B

Kimi K2

Phi-4 14B

Llama 3.1 405B

Llama 4 Scout 17B-16E

Llama 4 Maverick 17B-128E

Mistral Small 3.1 24B

GLM-4.6

MiniMax-Text-01

MiniMax M2

Gemma 3 27B

Qwen3 14B

NemotronH 47B

Getting Started

Tutorials

FlashInfer Trace

Dataset

Op Type Reference

Documentation Index

​Model Kernel Coverage

​Summary

​DeepSeek V3 / R1

​DeepSeek V3.2

​Llama 3.1 8B

​Qwen3 30B A3B

​Qwen3 Next 80B A3B

​Llama 3.1 / 3.3 70B

​Llama 3.2 3B

​Mistral 7B v0.3

​Mistral Nemo 12B

​Mixtral 8x7B

​Mixtral 8x22B

​Mixtral 8x22B at TP=2

​Qwen2.5 7B

​Qwen2.5 72B

​Qwen3 8B

​Qwen3 32B

​Qwen3 235B A22B

​Kimi K2

​Phi-4 14B

​Llama 3.1 405B

​Llama 4 Scout 17B-16E

​Llama 4 Maverick 17B-128E

​Mistral Small 3.1 24B

​GLM-4.6

​MiniMax-Text-01

​MiniMax M2

​Gemma 3 27B

​Qwen3 14B

​NemotronH 47B

Model Kernel Coverage

Summary

DeepSeek V3 / R1

DeepSeek V3.2

Llama 3.1 8B

Qwen3 30B A3B

Qwen3 Next 80B A3B

Llama 3.1 / 3.3 70B

Llama 3.2 3B

Mistral 7B v0.3

Mistral Nemo 12B

Mixtral 8x7B

Mixtral 8x22B

Mixtral 8x22B at TP=2

Qwen2.5 7B

Qwen2.5 72B

Qwen3 8B

Qwen3 32B

Qwen3 235B A22B

Kimi K2

Phi-4 14B

Llama 3.1 405B

Llama 4 Scout 17B-16E

Llama 4 Maverick 17B-128E

Mistral Small 3.1 24B

GLM-4.6

MiniMax-Text-01

MiniMax M2

Gemma 3 27B

Qwen3 14B

NemotronH 47B