Skip to main content

Model Kernel Coverage

This document tracks which kernels are supported in FlashInfer-Bench for each model.
  • βœ… Definition JSON exists and workload has been collected
  • 🟑 Definition JSON exists but workload has not yet been collected
  • ❌ Definition is referenced in models.ts but the file does not exist (missing)
  • β€” Module exists in the architecture but no definition is mapped (unmapped)

Summary

ModelArchitectureCoverage
DeepSeek V3/R1MLA + Dense/MoE🟑 Partial
DeepSeek V3.2DSA + Dense/MoEβœ… Fully covered
Llama 3.1 8BGQA + Denseβœ… Fully covered
Llama 3.1/3.3 70BGQA + Dense🟑 Partial
Llama 3.2 3BGQA + Dense🟑 Partial
Mistral 7B v0.3GQA + Dense🟑 Partial
Mistral Nemo 12BGQA + Dense🟑 Partial
Mixtral 8x7BGQA + MoE🟑 Partial
Mixtral 8x22BGQA + MoE❌ Not covered
Qwen2.5 7BGQA + Dense🟑 Partial
Qwen2.5 72BGQA + Dense🟑 Partial
Qwen3 8BGQA + Dense🟑 Partial
Qwen3 30B A3BGQA + MoE🟑 Partial
Qwen3 32BGQA + Dense🟑 Partial
Qwen3 235B A22BGQA + MoE🟑 Partial
Qwen3 Next 80B A3BGDN + GQA + MoE🟑 Partial
Kimi K2MLA + MoE🟑 Partial
Phi-4 14BGQA + Dense🟑 Partial
Llama 3.1 405BGQA + Dense🟑 Partial
Llama 4 Scout 17B-16EGQA + MoE🟑 Partial
Llama 4 Maverick 17B-128EGQA + MoE🟑 Partial
Mistral Small 3.1 24BGQA + Dense🟑 Partial
GLM-4.6GQA + Dense❌ Not covered
MiniMax M2 / Text-01Lightning Attn + MoE❌ Not covered
Gemma 3 27BGQA + Dense🟑 Partial
Qwen3 14BGQA + Dense🟑 Partial
NemotronH 47BGQA + Mamba2 Hybrid❌ Not covered

DeepSeek V3 / R1

Architecture: 61 decoder layers, MLA attention, hybrid Dense+MoE FFN
DefinitionOp TypeStatus
rmsnorm_h7168rmsnormβœ…
fused_add_rmsnorm_h7168rmsnormβœ…
rmsnorm_h1536rmsnormβœ…
rmsnorm_h512rmsnormβœ…
gemm_n256_k7168gemmβœ…
mla_ragged_prefill_causal_h16_qk192_vo128mla_ragged❌
mla_paged_prefill_causal_h16_ckv512_kpe64_ps1mla_pagedβœ…
mla_paged_prefill_causal_h16_ckv512_kpe64_ps64mla_pagedβœ…
mla_paged_decode_h16_ckv512_kpe64_ps1mla_pagedβœ…
mla_paged_decode_h16_ckv512_kpe64_ps64mla_pagedβœ…
moe_fp8_block_scale_ds_routing_topk8_ng8_kg4_e32_h7168_i2048moeβœ…
top_k_sampling_from_probs_v129280samplingβœ…
top_k_top_p_sampling_from_probs_v129280samplingβœ…
top_p_sampling_from_probs_v129280samplingβœ…
Coverage: 13 / 14 definitions present. Missing: MLA ragged prefill definition.

DeepSeek V3.2

Architecture: 61 decoder layers, DSA (DeepSeek Sparse Attention) replacing dense MLA, hybrid Dense+MoE FFN. Standard serving configuration: TP=8. DSA introduces a learned TopK indexer that selects a sparse subset of KV pages before running attention, reducing computation for long contexts while preserving accuracy.
DefinitionOp TypeStatus
rmsnorm_h7168rmsnormβœ…
fused_add_rmsnorm_h7168rmsnormβœ…
rmsnorm_h1536rmsnormβœ…
rmsnorm_h512rmsnormβœ…
dsa_topk_indexer_fp8_h64_d128_topk2048_ps64dsa_pagedβœ…
dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps1dsa_pagedβœ…
dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps64dsa_pagedβœ…
moe_fp8_block_scale_ds_routing_topk8_ng8_kg4_e32_h7168_i2048moeβœ…
Coverage: 8 / 8 definitions present. Fully covered.

Llama 3.1 8B

Architecture: 32 decoder layers, GQA attention, dense MLP
DefinitionOp TypeStatus
rmsnorm_h4096rmsnormβœ…
fused_add_rmsnorm_h4096rmsnormβœ…
gemm_n6144_k4096gemmβœ…
gemm_n4096_k4096gemmβœ…
gemm_n28672_k4096gemmβœ…
gemm_n4096_k14336gemmβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv8_d128gqa_raggedβœ…
top_k_sampling_from_probs_v128256samplingβœ…
top_k_top_p_sampling_from_probs_v128256samplingβœ…
top_p_sampling_from_probs_v128256samplingβœ…
Coverage: 14 / 14 definitions present. Fully covered.

Qwen3 30B A3B

Architecture: 32 decoder layers, GQA attention, MoE FFN (30 MoE + 2 dense layers)
DefinitionOp TypeStatus
rmsnorm_h128rmsnormβœ…
rmsnorm_h2048rmsnormβœ…
fused_add_rmsnorm_h2048rmsnormβœ…
gemm_n128_k2048gemmβœ…
gemm_n2048_k4096gemmβœ…
gemm_n5120_k2048gemmβœ…
gqa_paged_prefill_causal_h32_kv4_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv4_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv4_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv4_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv4_d128gqa_raggedβœ…
top_k_sampling_from_probs_v151936samplingβœ…
top_k_top_p_sampling_from_probs_v151936samplingβœ…
top_p_sampling_from_probs_v151936samplingβœ…
MoE gate / topk / expertsmoeβ€”
Coverage: 14 / 14 referenced definitions present. MoE kernels are not yet mapped in models.ts.

Qwen3 Next 80B A3B

Architecture: 48 layers total β€” 36 GDN (linear attention) + 12 GQA (standard attention), all layers use MoE FFN. Standard serving configuration: TP=2 or TP=4.
DefinitionOp TypeStatus
rmsnorm_h2048rmsnormβœ…
fused_add_rmsnorm_h2048rmsnormβœ…
gdn_prefill_qk16_v32_d128_k_lastgdn TP=1🟑
gdn_prefill_qk8_v16_d128_k_lastgdn TP=2βœ…
gdn_prefill_qk4_v8_d128_k_lastgdn TP=4βœ…
gdn_decode_qk16_v32_d128_k_lastgdn TP=1🟑
gdn_decode_qk8_v16_d128_k_lastgdn TP=2βœ…
gdn_decode_qk4_v8_d128_k_lastgdn TP=4βœ…
gdn_mtp_qk16_v32_d128_k_lastgdn TP=1🟑
gdn_mtp_qk8_v16_d128_k_lastgdn TP=2βœ…
gdn_mtp_qk4_v8_d128_k_lastgdn TP=4βœ…
gqa_paged_prefill_causal_h8_kv1_d256_ps1gqa_paged TP=2❌
gqa_paged_decode_h8_kv1_d256_ps1gqa_paged TP=2❌
gqa_ragged_prefill_causal_h8_kv1_d256gqa_ragged TP=2❌
MoE gate / topk / experts (GDN layers)moeβ€”
MoE gate / topk / experts (GQA layers)moeβ€”
Coverage: 9 / 14 referenced definitions present. Missing GDN definitions: TP=1 prefill and decode (qk16_v32). Missing GQA: h=8, kv=1, d=256 (TP=2 of original h=16, kv=2, d=256).

Llama 3.1 / 3.3 70B

Architecture: 80 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=4 (from sgl-cookbook). Llama 3.1 70B and 3.3 70B share identical architecture dimensions; only training data and context window differ.
DefinitionOp TypeStatus
rmsnorm_h8192rmsnorm❌
fused_add_rmsnorm_h8192rmsnorm❌
gqa_paged_prefill_causal_h16_kv2_d128_ps1gqa_paged TP=4❌
gqa_paged_prefill_causal_h16_kv2_d128_ps64gqa_paged TP=4❌
gqa_paged_decode_h16_kv2_d128_ps1gqa_paged TP=4❌
gqa_paged_decode_h16_kv2_d128_ps64gqa_paged TP=4❌
gqa_ragged_prefill_causal_h16_kv2_d128gqa_ragged TP=4❌
gemm_n10240_k8192gemm❌
gemm_n8192_k8192gemm❌
gemm_n57344_k8192gemm❌
gemm_n8192_k28672gemm❌
top_k_sampling_from_probs_v128256samplingβœ…
top_k_top_p_sampling_from_probs_v128256samplingβœ…
top_p_sampling_from_probs_v128256samplingβœ…
Coverage: 3 / 14 definitions present. Missing: rmsnorm h8192, all GQA definitions (h16_kv2_d128 at TP=4), all GEMM definitions for hidden=8192.

Llama 3.2 3B

Architecture: 28 decoder layers, GQA attention, dense MLP.
DefinitionOp TypeStatus
rmsnorm_h3072rmsnorm❌
fused_add_rmsnorm_h3072rmsnorm❌
gqa_paged_prefill_causal_h24_kv8_d128_ps1gqa_paged❌
gqa_paged_prefill_causal_h24_kv8_d128_ps64gqa_paged❌
gqa_paged_decode_h24_kv8_d128_ps1gqa_paged❌
gqa_paged_decode_h24_kv8_d128_ps64gqa_paged❌
gqa_ragged_prefill_causal_h24_kv8_d128gqa_ragged❌
gemm_n5120_k3072gemm❌
gemm_n3072_k3072gemm❌
gemm_n16384_k3072gemm❌
gemm_n3072_k8192gemm❌
top_k_sampling_from_probs_v128256samplingβœ…
top_k_top_p_sampling_from_probs_v128256samplingβœ…
top_p_sampling_from_probs_v128256samplingβœ…
Coverage: 3 / 14 definitions present. Missing: all rmsnorm, GQA, and GEMM definitions for hidden=3072.

Mistral 7B v0.3

Architecture: 32 decoder layers, GQA attention, dense MLP. Shares identical hidden, attention, and MLP dimensions with Llama 3.1 8B (hidden=4096, 32q/8kv heads, intermediate=14336).
DefinitionOp TypeStatus
rmsnorm_h4096rmsnormβœ…
fused_add_rmsnorm_h4096rmsnormβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv8_d128gqa_raggedβœ…
gemm_n6144_k4096gemmβœ…
gemm_n4096_k4096gemmβœ…
gemm_n28672_k4096gemmβœ…
gemm_n4096_k14336gemmβœ…
top_k_sampling_from_probs_v32000sampling❌
top_k_top_p_sampling_from_probs_v32000sampling❌
top_p_sampling_from_probs_v32000sampling❌
Coverage: 11 / 14 definitions present. Missing: sampling definitions for vocab_size=32000.

Mistral Nemo 12B

Architecture: 40 decoder layers, GQA attention (explicit head_dim=128), dense MLP. Standard serving configuration: TP=1 (from sgl-cookbook).
DefinitionOp TypeStatus
rmsnorm_h5120rmsnorm🟑
fused_add_rmsnorm_h5120rmsnorm🟑
gqa_paged_prefill_causal_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv8_d128gqa_raggedβœ…
gemm_n6144_k5120gemm❌
gemm_n5120_k4096gemm❌
gemm_n28872_k5120gemm❌
gemm_n5120_k14436gemm❌
top_k_sampling_from_probs_v131072sampling❌
top_k_top_p_sampling_from_probs_v131072sampling❌
top_p_sampling_from_probs_v131072sampling❌
Coverage: 7 / 14 definitions present. GQA defs are shared with Llama 3.1 8B; rmsnorm h5120 is now shared with Qwen3 14B. Missing: all GEMM (different hidden=5120 input dim), sampling v131072.

Mixtral 8x7B

Architecture: 32 decoder layers, GQA attention, sparse MoE FFN (8 experts, top-2 routing). Shares attention and normalization dimensions with Llama 3.1 8B / Mistral 7B.
DefinitionOp TypeStatus
rmsnorm_h4096rmsnormβœ…
fused_add_rmsnorm_h4096rmsnormβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv8_d128gqa_raggedβœ…
gemm_n6144_k4096gemmβœ…
gemm_n4096_k4096gemmβœ…
MoE experts (top-2, 8 experts, inter=14336)moeβ€”
top_k_sampling_from_probs_v32000sampling❌
top_k_top_p_sampling_from_probs_v32000sampling❌
top_p_sampling_from_probs_v32000sampling❌
Coverage: 9 / 12 referenced definitions present. MoE uses standard top-2 routing (not DeepSeek FP8 block-scale), so the existing MoE definition does not apply (unmapped). Missing: sampling v32000.

Mixtral 8x22B

Architecture: 56 decoder layers, GQA attention, sparse MoE FFN (8 experts, top-2 routing). All dimensions are new (hidden=6144, 48q/8kv heads).
DefinitionOp TypeStatus
rmsnorm_h6144rmsnorm❌
fused_add_rmsnorm_h6144rmsnorm❌
gqa_paged_prefill_causal_h48_kv8_d128_ps1gqa_paged❌
gqa_paged_prefill_causal_h48_kv8_d128_ps64gqa_paged❌
gqa_paged_decode_h48_kv8_d128_ps1gqa_paged❌
gqa_paged_decode_h48_kv8_d128_ps64gqa_paged❌
gqa_ragged_prefill_causal_h48_kv8_d128gqa_ragged❌
gemm_n8192_k6144gemm❌
gemm_n6144_k6144gemm❌
MoE experts (top-2, 8 experts, inter=16384)moeβ€”
top_k_sampling_from_probs_v32768sampling❌
top_k_top_p_sampling_from_probs_v32768sampling❌
top_p_sampling_from_probs_v32768sampling❌
Coverage: 0 / 12 referenced definitions present. No existing definitions match this architecture.

Qwen2.5 7B

Architecture: 28 decoder layers, GQA attention, dense MLP.
DefinitionOp TypeStatus
rmsnorm_h3584rmsnorm❌
fused_add_rmsnorm_h3584rmsnorm❌
gqa_paged_prefill_causal_h28_kv4_d128_ps1gqa_paged❌
gqa_paged_prefill_causal_h28_kv4_d128_ps64gqa_paged❌
gqa_paged_decode_h28_kv4_d128_ps1gqa_paged❌
gqa_paged_decode_h28_kv4_d128_ps64gqa_paged❌
gqa_ragged_prefill_causal_h28_kv4_d128gqa_ragged❌
gemm_n4608_k3584gemm❌
gemm_n3584_k3584gemm❌
gemm_n37888_k3584gemm❌
gemm_n3584_k18944gemm❌
top_k_sampling_from_probs_v151936samplingβœ…
top_k_top_p_sampling_from_probs_v151936samplingβœ…
top_p_sampling_from_probs_v151936samplingβœ…
Coverage: 3 / 14 definitions present. Missing: all rmsnorm, GQA, and GEMM definitions for hidden=3584.

Qwen2.5 72B

Architecture: 80 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=8 (from sgl-cookbook).
DefinitionOp TypeStatus
rmsnorm_h8192rmsnorm❌
fused_add_rmsnorm_h8192rmsnorm❌
gqa_paged_prefill_causal_h8_kv1_d128_ps1gqa_paged TP=8❌
gqa_paged_prefill_causal_h8_kv1_d128_ps64gqa_paged TP=8❌
gqa_paged_decode_h8_kv1_d128_ps1gqa_paged TP=8❌
gqa_paged_decode_h8_kv1_d128_ps64gqa_paged TP=8❌
gqa_ragged_prefill_causal_h8_kv1_d128gqa_ragged TP=8❌
gemm_n10240_k8192gemm❌
gemm_n8192_k8192gemm❌
gemm_n59392_k8192gemm❌
gemm_n8192_k29696gemm❌
top_k_sampling_from_probs_v151936samplingβœ…
top_k_top_p_sampling_from_probs_v151936samplingβœ…
top_p_sampling_from_probs_v151936samplingβœ…
Coverage: 3 / 14 definitions present. Missing: rmsnorm h8192, all GQA definitions (h8_kv1_d128 at TP=8), all GEMM definitions for hidden=8192.

Qwen3 8B

Architecture: 36 decoder layers, GQA attention, dense MLP. Shares hidden size and attention dimensions with Llama 3.1 8B (hidden=4096, 32q/8kv heads, head_dim=128), but uses a larger MLP intermediate size (22016 vs 14336).
DefinitionOp TypeStatus
rmsnorm_h4096rmsnormβœ…
fused_add_rmsnorm_h4096rmsnormβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv8_d128gqa_raggedβœ…
gemm_n6144_k4096gemmβœ…
gemm_n4096_k4096gemmβœ…
gemm_n44032_k4096gemm❌
gemm_n4096_k22016gemm❌
top_k_sampling_from_probs_v151936samplingβœ…
top_k_top_p_sampling_from_probs_v151936samplingβœ…
top_p_sampling_from_probs_v151936samplingβœ…
Coverage: 12 / 14 definitions present. Missing: gate_up GEMM (gemm_n44032_k4096, intermediate=22016 Γ— 2) and down GEMM (gemm_n4096_k22016). All normalization, attention, and non-MLP GEMM kernels are shared with Llama 3.1 8B.

Qwen3 32B

Architecture: 64 decoder layers, GQA attention, dense MLP. Uses a non-standard head_dim=64 (hidden=4096, 64 query heads). Standard serving configuration: TP=4.
DefinitionOp TypeStatus
rmsnorm_h4096rmsnormβœ…
fused_add_rmsnorm_h4096rmsnormβœ…
gqa_paged_prefill_causal_h16_kv2_d64_ps1gqa_paged TP=4❌
gqa_paged_prefill_causal_h16_kv2_d64_ps64gqa_paged TP=4❌
gqa_paged_decode_h16_kv2_d64_ps1gqa_paged TP=4❌
gqa_paged_decode_h16_kv2_d64_ps64gqa_paged TP=4❌
gqa_ragged_prefill_causal_h16_kv2_d64gqa_ragged TP=4❌
gemm_n5120_k4096gemm❌
gemm_n4096_k4096gemmβœ…
gemm_n44032_k4096gemm❌
gemm_n4096_k22016gemm❌
top_k_sampling_from_probs_v151936samplingβœ…
top_k_top_p_sampling_from_probs_v151936samplingβœ…
top_p_sampling_from_probs_v151936samplingβœ…
Coverage: 6 / 14 definitions present. Missing: all GQA definitions (head_dim=64 is a new value; all existing GQA defs use d=128), QKV and MLP GEMM defs. The o_proj GEMM (gemm_n4096_k4096) is shared because 64 heads Γ— 64 head_dim = 4096 = hidden_size.

Qwen3 235B A22B

Architecture: 94 decoder layers, GQA attention, sparse MoE FFN (128 experts, top-8 routing). Uses head_dim=64 (hidden=4096, 64 query heads). Standard serving configuration: TP=8, EP=2 (FP8 variant from sgl-cookbook). With 4 KV heads, effective per-device TP for attention is TP=4 (kv=1 per device).
DefinitionOp TypeStatus
rmsnorm_h4096rmsnormβœ…
fused_add_rmsnorm_h4096rmsnormβœ…
gqa_paged_prefill_causal_h16_kv1_d64_ps1gqa_paged TP=4❌
gqa_paged_prefill_causal_h16_kv1_d64_ps64gqa_paged TP=4❌
gqa_paged_decode_h16_kv1_d64_ps1gqa_paged TP=4❌
gqa_paged_decode_h16_kv1_d64_ps64gqa_paged TP=4❌
gqa_ragged_prefill_causal_h16_kv1_d64gqa_ragged TP=4❌
gemm_n4608_k4096gemm❌
gemm_n4096_k4096gemmβœ…
moe_fp8_block_scale_ds_routing_topk8_ng?_kg?_e64_h4096_i1536moe EP=2❌
top_k_sampling_from_probs_v151936samplingβœ…
top_k_top_p_sampling_from_probs_v151936samplingβœ…
top_p_sampling_from_probs_v151936samplingβœ…
Coverage: 6 / 13 referenced definitions present. Missing: all GQA defs (head_dim=64), QKV GEMM, and MoE (different hidden=4096 and intermediate=1536 vs existing h=7168, i=2048). The o_proj GEMM and rmsnorm are shared with other h=4096 models.

Kimi K2

Architecture: 61 decoder layers, MLA attention (same structure as DeepSeek V3), sparse MoE FFN (384 total experts, top-8 routing). Standard serving configuration: TP=8, EP=4 (from sgl-cookbook). Kimi K2 uses DeepSeek V3-style MLA with the same kv_lora_rank=512 and qk_rope_head_dim=64, but has 64 attention heads (vs 128 in DeepSeek V3). With TP=8 this gives h=8, requiring separate MLA definitions from DeepSeek V3’s h=16.
DefinitionOp TypeStatus
rmsnorm_h7168rmsnormβœ…
fused_add_rmsnorm_h7168rmsnormβœ…
rmsnorm_h1536rmsnormβœ…
rmsnorm_h512rmsnormβœ…
mla_paged_prefill_causal_h8_ckv512_kpe64_ps1mla_paged TP=8❌
mla_paged_prefill_causal_h8_ckv512_kpe64_ps64mla_paged TP=8❌
mla_paged_decode_h8_ckv512_kpe64_ps1mla_paged TP=8❌
mla_paged_decode_h8_ckv512_kpe64_ps64mla_paged TP=8❌
mla_ragged_prefill_causal_h8_qk192_vo128mla_ragged❌
moe_fp8_block_scale_ds_routing_topk8_ng?_kg?_e96_h7168_i2048moe EP=4❌
top_k_sampling_from_probs_v160000sampling❌
top_k_top_p_sampling_from_probs_v160000sampling❌
top_p_sampling_from_probs_v160000sampling❌
Coverage: 4 / 13 definitions present. RMSNorm definitions are shared with DeepSeek V3 (same hidden=7168 and sub-module dims). All MLA defs require new h=8 variants; MoE requires e=96 (384 experts / EP=4); sampling needs new v=160000 definitions.

Phi-4 14B

Architecture: 40 decoder layers, GQA attention (unusual 10 KV heads), dense MLP. All dimensions are new for this project.
DefinitionOp TypeStatus
rmsnorm_h5120rmsnorm🟑
fused_add_rmsnorm_h5120rmsnorm🟑
gqa_paged_prefill_causal_h40_kv10_d128_ps1gqa_paged❌
gqa_paged_prefill_causal_h40_kv10_d128_ps64gqa_paged❌
gqa_paged_decode_h40_kv10_d128_ps1gqa_paged❌
gqa_paged_decode_h40_kv10_d128_ps64gqa_paged❌
gqa_ragged_prefill_causal_h40_kv10_d128gqa_ragged❌
gemm_n7680_k5120gemm❌
gemm_n5120_k5120gemm🟑
gemm_n35840_k5120gemm❌
gemm_n5120_k17920gemm❌
top_k_sampling_from_probs_v100352sampling❌
top_k_top_p_sampling_from_probs_v100352sampling❌
top_p_sampling_from_probs_v100352sampling❌
Coverage: 3 / 14 definitions present. rmsnorm h5120 is now shared with Qwen3 14B; gemm_n5120_k5120 (o_proj shape) is shared since 40q*128=5120=hidden. Missing: all GQA defs (unusual 10 KV-head config), most GEMM, sampling v100352.

Llama 3.1 405B

Architecture: 126 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=4 (from sgl-cookbook). Uses the same Llama architecture as Llama 3.1 8B / 3.3 70B but at significantly larger scale (hidden=16384).
DefinitionOp TypeStatus
rmsnorm_h16384rmsnorm❌
fused_add_rmsnorm_h16384rmsnorm❌
gqa_paged_prefill_causal_h32_kv2_d128_ps1gqa_paged TP=4❌
gqa_paged_prefill_causal_h32_kv2_d128_ps64gqa_paged TP=4❌
gqa_paged_decode_h32_kv2_d128_ps1gqa_paged TP=4❌
gqa_paged_decode_h32_kv2_d128_ps64gqa_paged TP=4❌
gqa_ragged_prefill_causal_h32_kv2_d128gqa_ragged TP=4❌
gemm_n18432_k16384gemm❌
gemm_n16384_k16384gemm❌
gemm_n106496_k16384gemm❌
gemm_n16384_k53248gemm❌
top_k_sampling_from_probs_v128256samplingβœ…
top_k_top_p_sampling_from_probs_v128256samplingβœ…
top_p_sampling_from_probs_v128256samplingβœ…
Coverage: 3 / 14 definitions present. Sampling definitions are shared with Llama 3.1 8B (same vocab). Missing: rmsnorm h16384 and all GQA/GEMM definitions for this scale (TP=4 gives h=128/4=32 q-heads, kv=8/4=2 β€” the h32_kv2 configuration does not exist in current definitions).

Llama 4 Scout 17B-16E

Architecture: 48 decoder layers, interleaved GQA attention (NoPE global + RoPE local in 1:3 ratio), sparse MoE FFN (16 total experts, top-1 routing). Standard serving configuration: TP=8 (from sgl-cookbook). Multimodal (vision+text).
Note: Exact config.json values (hidden_size, intermediate_size) are pending verification from HuggingFace. Parameters below are estimates from the public model spec (17B activated parameters, 16 experts).
DefinitionOp TypeStatus
rmsnorm_h5120rmsnorm🟑
fused_add_rmsnorm_h5120rmsnorm🟑
gqa_paged_prefill_causal_h5_kv1_d128_ps1gqa_paged TP=8❌
gqa_paged_prefill_causal_h5_kv1_d128_ps64gqa_paged TP=8❌
gqa_paged_decode_h5_kv1_d128_ps1gqa_paged TP=8❌
gqa_paged_decode_h5_kv1_d128_ps64gqa_paged TP=8❌
gqa_ragged_prefill_causal_h5_kv1_d128gqa_ragged TP=8❌
MoE experts (top-1, 16 experts, standard routing)moeβ€”
top_k_sampling_from_probs_v202048sampling❌
top_k_top_p_sampling_from_probs_v202048sampling❌
top_p_sampling_from_probs_v202048sampling❌
Coverage: 2 / 10 referenced definitions present. rmsnorm h5120 is now shared with Qwen3 14B. Missing: all GQA defs (h=5 per device is unusual), MoE (standard top-1 routing not yet supported), sampling v202048.

Llama 4 Maverick 17B-128E

Architecture: Same base architecture as Llama 4 Scout but with 128 total experts (vs 16). Standard serving configuration: TP=8 (from sgl-cookbook).
Note: Exact config.json values are pending verification from HuggingFace.
DefinitionOp TypeStatus
rmsnorm_h5120rmsnorm🟑
fused_add_rmsnorm_h5120rmsnorm🟑
gqa_paged_prefill_causal_h5_kv1_d128_ps1gqa_paged TP=8❌
gqa_paged_prefill_causal_h5_kv1_d128_ps64gqa_paged TP=8❌
gqa_paged_decode_h5_kv1_d128_ps1gqa_paged TP=8❌
gqa_paged_decode_h5_kv1_d128_ps64gqa_paged TP=8❌
gqa_ragged_prefill_causal_h5_kv1_d128gqa_ragged TP=8❌
MoE experts (top-1, 128 experts, standard routing)moeβ€”
top_k_sampling_from_probs_v202048sampling❌
top_k_top_p_sampling_from_probs_v202048sampling❌
top_p_sampling_from_probs_v202048sampling❌
Coverage: 2 / 10 referenced definitions present. rmsnorm h5120 is now shared with Qwen3 14B. Same base dimensions as Llama 4 Scout; remaining definitions are shared once created. More experts (128 vs 16) affects MoE but not attention or normalization kernels.

Mistral Small 3.1 24B

Architecture: 40 decoder layers, GQA attention (explicit head_dim=128), dense MLP. Standard serving configuration: TP=2 (from sgl-cookbook). Shares the same attention configuration as Mistral Nemo 12B (hidden=5120 with explicit head_dim=128 giving 32 effective query heads).
DefinitionOp TypeStatus
rmsnorm_h5120rmsnorm🟑
fused_add_rmsnorm_h5120rmsnorm🟑
gqa_paged_prefill_causal_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_prefill_causal_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps1gqa_pagedβœ…
gqa_paged_decode_h32_kv8_d128_ps64gqa_pagedβœ…
gqa_ragged_prefill_causal_h32_kv8_d128gqa_raggedβœ…
gemm_n6144_k5120gemm❌
gemm_n5120_k4096gemm❌
gemm_n28672_k5120gemm❌
gemm_n5120_k14336gemm❌
top_k_sampling_from_probs_v131072sampling❌
top_k_top_p_sampling_from_probs_v131072sampling❌
top_p_sampling_from_probs_v131072sampling❌
Coverage: 7 / 14 definitions present. GQA kernels are shared with Mistral Nemo 12B and Llama 3.1 8B; rmsnorm h5120 is now shared with Qwen3 14B. Missing: GEMM defs with k=5120 input dim (Mistral-specific intermediate sizes), sampling v131072.

GLM-4.6

Architecture: Dense transformer with Dual Chunk Attention (DCA) β€” a variant of full attention with rotary embeddings. Served on Together AI and Fireworks; sgl-cookbook shows TP=8, EP=8 (high-throughput configuration), suggesting a very large MoE variant.
Note: Exact architecture parameters for GLM-4.6 require verification from the HuggingFace config.json (zai-org/GLM-4.6). The params below are based on the SGLang glm4.py defaults and may not reflect the actual model dimensions.
DefinitionOp TypeStatus
rmsnorm_h4096rmsnorm❌ (if hidden=4096)
fused_add_rmsnorm_h4096rmsnorm❌
GQA or custom DCA attentionattentionβ€”
MoE FFN (if applicable)moeβ€”
Sampling (vocab TBD)samplingβ€”
Coverage: 0 / ? definitions present. Architecture requires research. DCA attention may use standard GQA kernels at the computation level (FlashInfer’s paged/ragged wrappers) or require custom handling. Run /track-models --model-name glm46 --hf-repo-id zai-org/GLM-4.6 to fetch the exact config and update this section.

MiniMax M2 / Text-01

Architecture: Hybrid linear + softmax attention with MoE FFN. Uses a 7:1 ratio of Lightning Attention (linear) to standard Softmax Attention layers per 8-layer block, plus sparse MoE (32 experts, top-2 routing). Total parameters: ~456B with ~45.9B activated. 80 decoder layers, 64 attention heads, head_dim=128 (hiddenβ‰ˆ8192). Lightning Attention is a novel linear attention variant that does not use the standard softmax attention mechanism. It is not currently supported by FlashInfer and requires a new op type.
DefinitionOp TypeStatus
rmsnorm_h8192rmsnorm❌
fused_add_rmsnorm_h8192rmsnorm❌
Lightning Attention layers (7/8 of all layers)lightning_attn❌ (op type not supported)
Softmax Attention layers (1/8 of all layers)gqa_paged❌
MoE experts (top-2, 32 experts)moeβ€”
Sampling (vocab TBD)samplingβ€”
Coverage: 0 / ? definitions present. The primary blocker is Lightning Attention β€” a linear attention variant not yet in FlashInfer. The softmax attention layers (GQA-style) also require new definitions for this model’s specific dimensions. To add support, a new lightning_attn op type would first need to be defined.

Gemma 3 27B

Architecture: 62 decoder layers, GQA attention (2:1 ratio, 32 q-heads / 16 kv-heads, explicit head_dim=128 decoupled from hidden_size=5376), dense MLP with GeGLU activation. Note: hidden_size=5376 is non-standard; head_dim is explicitly 128 (not 5376/32=168). This is a multimodal model (vision+text) but the language backbone uses standard transformer attention.
DefinitionOp TypeStatus
rmsnorm_h5376rmsnorm🟑
fused_add_rmsnorm_h5376rmsnorm🟑
gqa_paged_prefill_causal_h32_kv16_d128_ps1gqa_paged🟑
gqa_paged_prefill_causal_h32_kv16_d128_ps64gqa_paged🟑
gqa_paged_decode_h32_kv16_d128_ps1gqa_paged🟑
gqa_paged_decode_h32_kv16_d128_ps64gqa_paged🟑
gqa_ragged_prefill_causal_h32_kv16_d128gqa_ragged🟑
gemm_n4096_k5376gemm (q_proj)🟑
gemm_n2048_k5376gemm (k/v proj)🟑
gemm_n5376_k4096gemm (o_proj)🟑
gemm_n21504_k5376gemm (gate/up proj)🟑
gemm_n5376_k21504gemm (down proj)🟑
top_k_sampling_from_probs_v262208sampling🟑
top_k_top_p_sampling_from_probs_v262208sampling🟑
top_p_sampling_from_probs_v262208sampling🟑
Coverage: 15 / 15 definitions present. All dimensions are unique to this model: hidden=5376, intermediate=21504, vocab=262208. GQA ratio is 2:1 (vs 4:1 for Llama/Qwen), so kv_heads=16 (not 8). Workloads not yet collected.

Qwen3 14B

Architecture: 40 decoder layers, GQA attention (5:1 ratio, 40 q-heads / 8 kv-heads, head_dim=128), dense MLP. Standard serving configuration: TP=2 (from sgl-cookbook), giving 20 q-heads and 4 kv-heads per device.
DefinitionOp TypeStatus
rmsnorm_h5120rmsnorm🟑
fused_add_rmsnorm_h5120rmsnorm🟑
gqa_paged_prefill_causal_h20_kv4_d128_ps1gqa_paged TP=2🟑
gqa_paged_prefill_causal_h20_kv4_d128_ps64gqa_paged TP=2🟑
gqa_paged_decode_h20_kv4_d128_ps1gqa_paged TP=2🟑
gqa_paged_decode_h20_kv4_d128_ps64gqa_paged TP=2🟑
gqa_ragged_prefill_causal_h20_kv4_d128gqa_ragged TP=2🟑
gemm_n7168_k5120gemm (qkv_proj combined)🟑
gemm_n5120_k5120gemm (o_proj)🟑
gemm_n34816_k5120gemm (gate_up combined)🟑
gemm_n5120_k17408gemm (down proj)🟑
top_k_sampling_from_probs_v151936samplingβœ…
top_k_top_p_sampling_from_probs_v151936samplingβœ…
top_p_sampling_from_probs_v151936samplingβœ…
Coverage: 14 / 14 definitions present. Sampling definitions are shared with Qwen2.5 and other Qwen3 models (same vocab=151936) which already have workloads collected. The rmsnorm_h5120 definition is also shared with Mistral Nemo 12B, Mistral Small 3.1 24B, Phi-4 14B, and Llama 4 Scout/Maverick. Non-sampling workloads not yet collected.

NemotronH 47B

Architecture: 52 decoder layers total β€” hybrid of standard GQA (Transformer) and Mamba2 SSM layers. Uses 20 GQA attention layers and 32 Mamba2 layers in an interleaved pattern. Standard serving configuration: TP=8 (from sgl-cookbook). Mamba2 SSM (Structured State Space Model) is a linear recurrent architecture that does not use softmax attention. It maintains a fixed-size state matrix updated at each step, analogous to a hidden state in RNNs. Mamba2 is not currently supported as an op type in FlashInfer-Bench and requires defining a new mamba_ssu (Selective State-space Unit) operation type before this model can be tracked.
DefinitionOp TypeStatus
rmsnorm_h{hidden}rmsnorm❌ (dims TBD)
GQA attention layers (20 layers, TP=8)gqa_paged❌
Mamba2 SSM layers (32 layers)mamba_ssu❌ (op type not supported)
MLP / MoE FFNgemm / moe❌
Samplingsampling❌
Coverage: 0 / ? definitions present. The primary blocker is the Mamba2 SSM op type β€” a selective state-space operation not yet defined in FlashInfer-Bench. This is analogous to MiniMax M2’s Lightning Attention blocker. To add support, a new mamba_ssu op type schema would first need to be defined. Once that exists, the GQA attention layers could reuse existing definitions if dimensions match.