Model Kernel Coverage
This document tracks which kernels are supported in FlashInfer-Bench for each model.- β Definition JSON exists and workload has been collected
- π‘ Definition JSON exists but workload has not yet been collected
- β Definition is referenced in
models.tsbut the file does not exist (missing) - β Module exists in the architecture but no definition is mapped (unmapped)
Summary
| Model | Architecture | Coverage |
|---|---|---|
| DeepSeek V3/R1 | MLA + Dense/MoE | π‘ Partial |
| DeepSeek V3.2 | DSA + Dense/MoE | β Fully covered |
| Llama 3.1 8B | GQA + Dense | β Fully covered |
| Llama 3.1/3.3 70B | GQA + Dense | π‘ Partial |
| Llama 3.2 3B | GQA + Dense | π‘ Partial |
| Mistral 7B v0.3 | GQA + Dense | π‘ Partial |
| Mistral Nemo 12B | GQA + Dense | π‘ Partial |
| Mixtral 8x7B | GQA + MoE | π‘ Partial |
| Mixtral 8x22B | GQA + MoE | β Not covered |
| Qwen2.5 7B | GQA + Dense | π‘ Partial |
| Qwen2.5 72B | GQA + Dense | π‘ Partial |
| Qwen3 8B | GQA + Dense | π‘ Partial |
| Qwen3 30B A3B | GQA + MoE | π‘ Partial |
| Qwen3 32B | GQA + Dense | π‘ Partial |
| Qwen3 235B A22B | GQA + MoE | π‘ Partial |
| Qwen3 Next 80B A3B | GDN + GQA + MoE | π‘ Partial |
| Kimi K2 | MLA + MoE | π‘ Partial |
| Phi-4 14B | GQA + Dense | π‘ Partial |
| Llama 3.1 405B | GQA + Dense | π‘ Partial |
| Llama 4 Scout 17B-16E | GQA + MoE | π‘ Partial |
| Llama 4 Maverick 17B-128E | GQA + MoE | π‘ Partial |
| Mistral Small 3.1 24B | GQA + Dense | π‘ Partial |
| GLM-4.6 | GQA + Dense | β Not covered |
| MiniMax M2 / Text-01 | Lightning Attn + MoE | β Not covered |
| Gemma 3 27B | GQA + Dense | π‘ Partial |
| Qwen3 14B | GQA + Dense | π‘ Partial |
| NemotronH 47B | GQA + Mamba2 Hybrid | β Not covered |
DeepSeek V3 / R1
Architecture: 61 decoder layers, MLA attention, hybrid Dense+MoE FFN| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h7168 | rmsnorm | β |
fused_add_rmsnorm_h7168 | rmsnorm | β |
rmsnorm_h1536 | rmsnorm | β |
rmsnorm_h512 | rmsnorm | β |
gemm_n256_k7168 | gemm | β |
mla_ragged_prefill_causal_h16_qk192_vo128 | mla_ragged | β |
mla_paged_prefill_causal_h16_ckv512_kpe64_ps1 | mla_paged | β |
mla_paged_prefill_causal_h16_ckv512_kpe64_ps64 | mla_paged | β |
mla_paged_decode_h16_ckv512_kpe64_ps1 | mla_paged | β |
mla_paged_decode_h16_ckv512_kpe64_ps64 | mla_paged | β |
moe_fp8_block_scale_ds_routing_topk8_ng8_kg4_e32_h7168_i2048 | moe | β |
top_k_sampling_from_probs_v129280 | sampling | β |
top_k_top_p_sampling_from_probs_v129280 | sampling | β |
top_p_sampling_from_probs_v129280 | sampling | β |
DeepSeek V3.2
Architecture: 61 decoder layers, DSA (DeepSeek Sparse Attention) replacing dense MLA, hybrid Dense+MoE FFN. Standard serving configuration: TP=8. DSA introduces a learned TopK indexer that selects a sparse subset of KV pages before running attention, reducing computation for long contexts while preserving accuracy.| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h7168 | rmsnorm | β |
fused_add_rmsnorm_h7168 | rmsnorm | β |
rmsnorm_h1536 | rmsnorm | β |
rmsnorm_h512 | rmsnorm | β |
dsa_topk_indexer_fp8_h64_d128_topk2048_ps64 | dsa_paged | β |
dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps1 | dsa_paged | β |
dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps64 | dsa_paged | β |
moe_fp8_block_scale_ds_routing_topk8_ng8_kg4_e32_h7168_i2048 | moe | β |
Llama 3.1 8B
Architecture: 32 decoder layers, GQA attention, dense MLP| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h4096 | rmsnorm | β |
fused_add_rmsnorm_h4096 | rmsnorm | β |
gemm_n6144_k4096 | gemm | β |
gemm_n4096_k4096 | gemm | β |
gemm_n28672_k4096 | gemm | β |
gemm_n4096_k14336 | gemm | β |
gqa_paged_prefill_causal_h32_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_prefill_causal_h32_kv8_d128_ps64 | gqa_paged | β |
gqa_paged_decode_h32_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_decode_h32_kv8_d128_ps64 | gqa_paged | β |
gqa_ragged_prefill_causal_h32_kv8_d128 | gqa_ragged | β |
top_k_sampling_from_probs_v128256 | sampling | β |
top_k_top_p_sampling_from_probs_v128256 | sampling | β |
top_p_sampling_from_probs_v128256 | sampling | β |
Qwen3 30B A3B
Architecture: 32 decoder layers, GQA attention, MoE FFN (30 MoE + 2 dense layers)| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h128 | rmsnorm | β |
rmsnorm_h2048 | rmsnorm | β |
fused_add_rmsnorm_h2048 | rmsnorm | β |
gemm_n128_k2048 | gemm | β |
gemm_n2048_k4096 | gemm | β |
gemm_n5120_k2048 | gemm | β |
gqa_paged_prefill_causal_h32_kv4_d128_ps1 | gqa_paged | β |
gqa_paged_prefill_causal_h32_kv4_d128_ps64 | gqa_paged | β |
gqa_paged_decode_h32_kv4_d128_ps1 | gqa_paged | β |
gqa_paged_decode_h32_kv4_d128_ps64 | gqa_paged | β |
gqa_ragged_prefill_causal_h32_kv4_d128 | gqa_ragged | β |
top_k_sampling_from_probs_v151936 | sampling | β |
top_k_top_p_sampling_from_probs_v151936 | sampling | β |
top_p_sampling_from_probs_v151936 | sampling | β |
| MoE gate / topk / experts | moe | β |
models.ts.
Qwen3 Next 80B A3B
Architecture: 48 layers total β 36 GDN (linear attention) + 12 GQA (standard attention), all layers use MoE FFN. Standard serving configuration: TP=2 or TP=4.| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h2048 | rmsnorm | β |
fused_add_rmsnorm_h2048 | rmsnorm | β |
gdn_prefill_qk16_v32_d128_k_last | gdn TP=1 | π‘ |
gdn_prefill_qk8_v16_d128_k_last | gdn TP=2 | β |
gdn_prefill_qk4_v8_d128_k_last | gdn TP=4 | β |
gdn_decode_qk16_v32_d128_k_last | gdn TP=1 | π‘ |
gdn_decode_qk8_v16_d128_k_last | gdn TP=2 | β |
gdn_decode_qk4_v8_d128_k_last | gdn TP=4 | β |
gdn_mtp_qk16_v32_d128_k_last | gdn TP=1 | π‘ |
gdn_mtp_qk8_v16_d128_k_last | gdn TP=2 | β |
gdn_mtp_qk4_v8_d128_k_last | gdn TP=4 | β |
gqa_paged_prefill_causal_h8_kv1_d256_ps1 | gqa_paged TP=2 | β |
gqa_paged_decode_h8_kv1_d256_ps1 | gqa_paged TP=2 | β |
gqa_ragged_prefill_causal_h8_kv1_d256 | gqa_ragged TP=2 | β |
| MoE gate / topk / experts (GDN layers) | moe | β |
| MoE gate / topk / experts (GQA layers) | moe | β |
Llama 3.1 / 3.3 70B
Architecture: 80 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=4 (from sgl-cookbook). Llama 3.1 70B and 3.3 70B share identical architecture dimensions; only training data and context window differ.| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h8192 | rmsnorm | β |
fused_add_rmsnorm_h8192 | rmsnorm | β |
gqa_paged_prefill_causal_h16_kv2_d128_ps1 | gqa_paged TP=4 | β |
gqa_paged_prefill_causal_h16_kv2_d128_ps64 | gqa_paged TP=4 | β |
gqa_paged_decode_h16_kv2_d128_ps1 | gqa_paged TP=4 | β |
gqa_paged_decode_h16_kv2_d128_ps64 | gqa_paged TP=4 | β |
gqa_ragged_prefill_causal_h16_kv2_d128 | gqa_ragged TP=4 | β |
gemm_n10240_k8192 | gemm | β |
gemm_n8192_k8192 | gemm | β |
gemm_n57344_k8192 | gemm | β |
gemm_n8192_k28672 | gemm | β |
top_k_sampling_from_probs_v128256 | sampling | β |
top_k_top_p_sampling_from_probs_v128256 | sampling | β |
top_p_sampling_from_probs_v128256 | sampling | β |
Llama 3.2 3B
Architecture: 28 decoder layers, GQA attention, dense MLP.| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h3072 | rmsnorm | β |
fused_add_rmsnorm_h3072 | rmsnorm | β |
gqa_paged_prefill_causal_h24_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_prefill_causal_h24_kv8_d128_ps64 | gqa_paged | β |
gqa_paged_decode_h24_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_decode_h24_kv8_d128_ps64 | gqa_paged | β |
gqa_ragged_prefill_causal_h24_kv8_d128 | gqa_ragged | β |
gemm_n5120_k3072 | gemm | β |
gemm_n3072_k3072 | gemm | β |
gemm_n16384_k3072 | gemm | β |
gemm_n3072_k8192 | gemm | β |
top_k_sampling_from_probs_v128256 | sampling | β |
top_k_top_p_sampling_from_probs_v128256 | sampling | β |
top_p_sampling_from_probs_v128256 | sampling | β |
Mistral 7B v0.3
Architecture: 32 decoder layers, GQA attention, dense MLP. Shares identical hidden, attention, and MLP dimensions with Llama 3.1 8B (hidden=4096, 32q/8kv heads, intermediate=14336).| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h4096 | rmsnorm | β |
fused_add_rmsnorm_h4096 | rmsnorm | β |
gqa_paged_prefill_causal_h32_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_prefill_causal_h32_kv8_d128_ps64 | gqa_paged | β |
gqa_paged_decode_h32_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_decode_h32_kv8_d128_ps64 | gqa_paged | β |
gqa_ragged_prefill_causal_h32_kv8_d128 | gqa_ragged | β |
gemm_n6144_k4096 | gemm | β |
gemm_n4096_k4096 | gemm | β |
gemm_n28672_k4096 | gemm | β |
gemm_n4096_k14336 | gemm | β |
top_k_sampling_from_probs_v32000 | sampling | β |
top_k_top_p_sampling_from_probs_v32000 | sampling | β |
top_p_sampling_from_probs_v32000 | sampling | β |
Mistral Nemo 12B
Architecture: 40 decoder layers, GQA attention (explicit head_dim=128), dense MLP. Standard serving configuration: TP=1 (from sgl-cookbook).| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h5120 | rmsnorm | π‘ |
fused_add_rmsnorm_h5120 | rmsnorm | π‘ |
gqa_paged_prefill_causal_h32_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_prefill_causal_h32_kv8_d128_ps64 | gqa_paged | β |
gqa_paged_decode_h32_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_decode_h32_kv8_d128_ps64 | gqa_paged | β |
gqa_ragged_prefill_causal_h32_kv8_d128 | gqa_ragged | β |
gemm_n6144_k5120 | gemm | β |
gemm_n5120_k4096 | gemm | β |
gemm_n28872_k5120 | gemm | β |
gemm_n5120_k14436 | gemm | β |
top_k_sampling_from_probs_v131072 | sampling | β |
top_k_top_p_sampling_from_probs_v131072 | sampling | β |
top_p_sampling_from_probs_v131072 | sampling | β |
Mixtral 8x7B
Architecture: 32 decoder layers, GQA attention, sparse MoE FFN (8 experts, top-2 routing). Shares attention and normalization dimensions with Llama 3.1 8B / Mistral 7B.| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h4096 | rmsnorm | β |
fused_add_rmsnorm_h4096 | rmsnorm | β |
gqa_paged_prefill_causal_h32_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_prefill_causal_h32_kv8_d128_ps64 | gqa_paged | β |
gqa_paged_decode_h32_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_decode_h32_kv8_d128_ps64 | gqa_paged | β |
gqa_ragged_prefill_causal_h32_kv8_d128 | gqa_ragged | β |
gemm_n6144_k4096 | gemm | β |
gemm_n4096_k4096 | gemm | β |
| MoE experts (top-2, 8 experts, inter=14336) | moe | β |
top_k_sampling_from_probs_v32000 | sampling | β |
top_k_top_p_sampling_from_probs_v32000 | sampling | β |
top_p_sampling_from_probs_v32000 | sampling | β |
Mixtral 8x22B
Architecture: 56 decoder layers, GQA attention, sparse MoE FFN (8 experts, top-2 routing). All dimensions are new (hidden=6144, 48q/8kv heads).| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h6144 | rmsnorm | β |
fused_add_rmsnorm_h6144 | rmsnorm | β |
gqa_paged_prefill_causal_h48_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_prefill_causal_h48_kv8_d128_ps64 | gqa_paged | β |
gqa_paged_decode_h48_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_decode_h48_kv8_d128_ps64 | gqa_paged | β |
gqa_ragged_prefill_causal_h48_kv8_d128 | gqa_ragged | β |
gemm_n8192_k6144 | gemm | β |
gemm_n6144_k6144 | gemm | β |
| MoE experts (top-2, 8 experts, inter=16384) | moe | β |
top_k_sampling_from_probs_v32768 | sampling | β |
top_k_top_p_sampling_from_probs_v32768 | sampling | β |
top_p_sampling_from_probs_v32768 | sampling | β |
Qwen2.5 7B
Architecture: 28 decoder layers, GQA attention, dense MLP.| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h3584 | rmsnorm | β |
fused_add_rmsnorm_h3584 | rmsnorm | β |
gqa_paged_prefill_causal_h28_kv4_d128_ps1 | gqa_paged | β |
gqa_paged_prefill_causal_h28_kv4_d128_ps64 | gqa_paged | β |
gqa_paged_decode_h28_kv4_d128_ps1 | gqa_paged | β |
gqa_paged_decode_h28_kv4_d128_ps64 | gqa_paged | β |
gqa_ragged_prefill_causal_h28_kv4_d128 | gqa_ragged | β |
gemm_n4608_k3584 | gemm | β |
gemm_n3584_k3584 | gemm | β |
gemm_n37888_k3584 | gemm | β |
gemm_n3584_k18944 | gemm | β |
top_k_sampling_from_probs_v151936 | sampling | β |
top_k_top_p_sampling_from_probs_v151936 | sampling | β |
top_p_sampling_from_probs_v151936 | sampling | β |
Qwen2.5 72B
Architecture: 80 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=8 (from sgl-cookbook).| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h8192 | rmsnorm | β |
fused_add_rmsnorm_h8192 | rmsnorm | β |
gqa_paged_prefill_causal_h8_kv1_d128_ps1 | gqa_paged TP=8 | β |
gqa_paged_prefill_causal_h8_kv1_d128_ps64 | gqa_paged TP=8 | β |
gqa_paged_decode_h8_kv1_d128_ps1 | gqa_paged TP=8 | β |
gqa_paged_decode_h8_kv1_d128_ps64 | gqa_paged TP=8 | β |
gqa_ragged_prefill_causal_h8_kv1_d128 | gqa_ragged TP=8 | β |
gemm_n10240_k8192 | gemm | β |
gemm_n8192_k8192 | gemm | β |
gemm_n59392_k8192 | gemm | β |
gemm_n8192_k29696 | gemm | β |
top_k_sampling_from_probs_v151936 | sampling | β |
top_k_top_p_sampling_from_probs_v151936 | sampling | β |
top_p_sampling_from_probs_v151936 | sampling | β |
Qwen3 8B
Architecture: 36 decoder layers, GQA attention, dense MLP. Shares hidden size and attention dimensions with Llama 3.1 8B (hidden=4096, 32q/8kv heads, head_dim=128), but uses a larger MLP intermediate size (22016 vs 14336).| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h4096 | rmsnorm | β |
fused_add_rmsnorm_h4096 | rmsnorm | β |
gqa_paged_prefill_causal_h32_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_prefill_causal_h32_kv8_d128_ps64 | gqa_paged | β |
gqa_paged_decode_h32_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_decode_h32_kv8_d128_ps64 | gqa_paged | β |
gqa_ragged_prefill_causal_h32_kv8_d128 | gqa_ragged | β |
gemm_n6144_k4096 | gemm | β |
gemm_n4096_k4096 | gemm | β |
gemm_n44032_k4096 | gemm | β |
gemm_n4096_k22016 | gemm | β |
top_k_sampling_from_probs_v151936 | sampling | β |
top_k_top_p_sampling_from_probs_v151936 | sampling | β |
top_p_sampling_from_probs_v151936 | sampling | β |
gemm_n44032_k4096, intermediate=22016 Γ 2) and down GEMM (gemm_n4096_k22016). All normalization, attention, and non-MLP GEMM kernels are shared with Llama 3.1 8B.
Qwen3 32B
Architecture: 64 decoder layers, GQA attention, dense MLP. Uses a non-standard head_dim=64 (hidden=4096, 64 query heads). Standard serving configuration: TP=4.| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h4096 | rmsnorm | β |
fused_add_rmsnorm_h4096 | rmsnorm | β |
gqa_paged_prefill_causal_h16_kv2_d64_ps1 | gqa_paged TP=4 | β |
gqa_paged_prefill_causal_h16_kv2_d64_ps64 | gqa_paged TP=4 | β |
gqa_paged_decode_h16_kv2_d64_ps1 | gqa_paged TP=4 | β |
gqa_paged_decode_h16_kv2_d64_ps64 | gqa_paged TP=4 | β |
gqa_ragged_prefill_causal_h16_kv2_d64 | gqa_ragged TP=4 | β |
gemm_n5120_k4096 | gemm | β |
gemm_n4096_k4096 | gemm | β |
gemm_n44032_k4096 | gemm | β |
gemm_n4096_k22016 | gemm | β |
top_k_sampling_from_probs_v151936 | sampling | β |
top_k_top_p_sampling_from_probs_v151936 | sampling | β |
top_p_sampling_from_probs_v151936 | sampling | β |
gemm_n4096_k4096) is shared because 64 heads Γ 64 head_dim = 4096 = hidden_size.
Qwen3 235B A22B
Architecture: 94 decoder layers, GQA attention, sparse MoE FFN (128 experts, top-8 routing). Uses head_dim=64 (hidden=4096, 64 query heads). Standard serving configuration: TP=8, EP=2 (FP8 variant from sgl-cookbook). With 4 KV heads, effective per-device TP for attention is TP=4 (kv=1 per device).| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h4096 | rmsnorm | β |
fused_add_rmsnorm_h4096 | rmsnorm | β |
gqa_paged_prefill_causal_h16_kv1_d64_ps1 | gqa_paged TP=4 | β |
gqa_paged_prefill_causal_h16_kv1_d64_ps64 | gqa_paged TP=4 | β |
gqa_paged_decode_h16_kv1_d64_ps1 | gqa_paged TP=4 | β |
gqa_paged_decode_h16_kv1_d64_ps64 | gqa_paged TP=4 | β |
gqa_ragged_prefill_causal_h16_kv1_d64 | gqa_ragged TP=4 | β |
gemm_n4608_k4096 | gemm | β |
gemm_n4096_k4096 | gemm | β |
moe_fp8_block_scale_ds_routing_topk8_ng?_kg?_e64_h4096_i1536 | moe EP=2 | β |
top_k_sampling_from_probs_v151936 | sampling | β |
top_k_top_p_sampling_from_probs_v151936 | sampling | β |
top_p_sampling_from_probs_v151936 | sampling | β |
Kimi K2
Architecture: 61 decoder layers, MLA attention (same structure as DeepSeek V3), sparse MoE FFN (384 total experts, top-8 routing). Standard serving configuration: TP=8, EP=4 (from sgl-cookbook). Kimi K2 uses DeepSeek V3-style MLA with the same kv_lora_rank=512 and qk_rope_head_dim=64, but has 64 attention heads (vs 128 in DeepSeek V3). With TP=8 this gives h=8, requiring separate MLA definitions from DeepSeek V3βs h=16.| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h7168 | rmsnorm | β |
fused_add_rmsnorm_h7168 | rmsnorm | β |
rmsnorm_h1536 | rmsnorm | β |
rmsnorm_h512 | rmsnorm | β |
mla_paged_prefill_causal_h8_ckv512_kpe64_ps1 | mla_paged TP=8 | β |
mla_paged_prefill_causal_h8_ckv512_kpe64_ps64 | mla_paged TP=8 | β |
mla_paged_decode_h8_ckv512_kpe64_ps1 | mla_paged TP=8 | β |
mla_paged_decode_h8_ckv512_kpe64_ps64 | mla_paged TP=8 | β |
mla_ragged_prefill_causal_h8_qk192_vo128 | mla_ragged | β |
moe_fp8_block_scale_ds_routing_topk8_ng?_kg?_e96_h7168_i2048 | moe EP=4 | β |
top_k_sampling_from_probs_v160000 | sampling | β |
top_k_top_p_sampling_from_probs_v160000 | sampling | β |
top_p_sampling_from_probs_v160000 | sampling | β |
Phi-4 14B
Architecture: 40 decoder layers, GQA attention (unusual 10 KV heads), dense MLP. All dimensions are new for this project.| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h5120 | rmsnorm | π‘ |
fused_add_rmsnorm_h5120 | rmsnorm | π‘ |
gqa_paged_prefill_causal_h40_kv10_d128_ps1 | gqa_paged | β |
gqa_paged_prefill_causal_h40_kv10_d128_ps64 | gqa_paged | β |
gqa_paged_decode_h40_kv10_d128_ps1 | gqa_paged | β |
gqa_paged_decode_h40_kv10_d128_ps64 | gqa_paged | β |
gqa_ragged_prefill_causal_h40_kv10_d128 | gqa_ragged | β |
gemm_n7680_k5120 | gemm | β |
gemm_n5120_k5120 | gemm | π‘ |
gemm_n35840_k5120 | gemm | β |
gemm_n5120_k17920 | gemm | β |
top_k_sampling_from_probs_v100352 | sampling | β |
top_k_top_p_sampling_from_probs_v100352 | sampling | β |
top_p_sampling_from_probs_v100352 | sampling | β |
gemm_n5120_k5120 (o_proj shape) is shared since 40q*128=5120=hidden. Missing: all GQA defs (unusual 10 KV-head config), most GEMM, sampling v100352.
Llama 3.1 405B
Architecture: 126 decoder layers, GQA attention, dense MLP. Standard serving configuration: TP=4 (from sgl-cookbook). Uses the same Llama architecture as Llama 3.1 8B / 3.3 70B but at significantly larger scale (hidden=16384).| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h16384 | rmsnorm | β |
fused_add_rmsnorm_h16384 | rmsnorm | β |
gqa_paged_prefill_causal_h32_kv2_d128_ps1 | gqa_paged TP=4 | β |
gqa_paged_prefill_causal_h32_kv2_d128_ps64 | gqa_paged TP=4 | β |
gqa_paged_decode_h32_kv2_d128_ps1 | gqa_paged TP=4 | β |
gqa_paged_decode_h32_kv2_d128_ps64 | gqa_paged TP=4 | β |
gqa_ragged_prefill_causal_h32_kv2_d128 | gqa_ragged TP=4 | β |
gemm_n18432_k16384 | gemm | β |
gemm_n16384_k16384 | gemm | β |
gemm_n106496_k16384 | gemm | β |
gemm_n16384_k53248 | gemm | β |
top_k_sampling_from_probs_v128256 | sampling | β |
top_k_top_p_sampling_from_probs_v128256 | sampling | β |
top_p_sampling_from_probs_v128256 | sampling | β |
Llama 4 Scout 17B-16E
Architecture: 48 decoder layers, interleaved GQA attention (NoPE global + RoPE local in 1:3 ratio), sparse MoE FFN (16 total experts, top-1 routing). Standard serving configuration: TP=8 (from sgl-cookbook). Multimodal (vision+text).Note: Exact config.json values (hidden_size, intermediate_size) are pending verification from HuggingFace. Parameters below are estimates from the public model spec (17B activated parameters, 16 experts).
| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h5120 | rmsnorm | π‘ |
fused_add_rmsnorm_h5120 | rmsnorm | π‘ |
gqa_paged_prefill_causal_h5_kv1_d128_ps1 | gqa_paged TP=8 | β |
gqa_paged_prefill_causal_h5_kv1_d128_ps64 | gqa_paged TP=8 | β |
gqa_paged_decode_h5_kv1_d128_ps1 | gqa_paged TP=8 | β |
gqa_paged_decode_h5_kv1_d128_ps64 | gqa_paged TP=8 | β |
gqa_ragged_prefill_causal_h5_kv1_d128 | gqa_ragged TP=8 | β |
| MoE experts (top-1, 16 experts, standard routing) | moe | β |
top_k_sampling_from_probs_v202048 | sampling | β |
top_k_top_p_sampling_from_probs_v202048 | sampling | β |
top_p_sampling_from_probs_v202048 | sampling | β |
Llama 4 Maverick 17B-128E
Architecture: Same base architecture as Llama 4 Scout but with 128 total experts (vs 16). Standard serving configuration: TP=8 (from sgl-cookbook).Note: Exact config.json values are pending verification from HuggingFace.
| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h5120 | rmsnorm | π‘ |
fused_add_rmsnorm_h5120 | rmsnorm | π‘ |
gqa_paged_prefill_causal_h5_kv1_d128_ps1 | gqa_paged TP=8 | β |
gqa_paged_prefill_causal_h5_kv1_d128_ps64 | gqa_paged TP=8 | β |
gqa_paged_decode_h5_kv1_d128_ps1 | gqa_paged TP=8 | β |
gqa_paged_decode_h5_kv1_d128_ps64 | gqa_paged TP=8 | β |
gqa_ragged_prefill_causal_h5_kv1_d128 | gqa_ragged TP=8 | β |
| MoE experts (top-1, 128 experts, standard routing) | moe | β |
top_k_sampling_from_probs_v202048 | sampling | β |
top_k_top_p_sampling_from_probs_v202048 | sampling | β |
top_p_sampling_from_probs_v202048 | sampling | β |
Mistral Small 3.1 24B
Architecture: 40 decoder layers, GQA attention (explicit head_dim=128), dense MLP. Standard serving configuration: TP=2 (from sgl-cookbook). Shares the same attention configuration as Mistral Nemo 12B (hidden=5120 with explicit head_dim=128 giving 32 effective query heads).| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h5120 | rmsnorm | π‘ |
fused_add_rmsnorm_h5120 | rmsnorm | π‘ |
gqa_paged_prefill_causal_h32_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_prefill_causal_h32_kv8_d128_ps64 | gqa_paged | β |
gqa_paged_decode_h32_kv8_d128_ps1 | gqa_paged | β |
gqa_paged_decode_h32_kv8_d128_ps64 | gqa_paged | β |
gqa_ragged_prefill_causal_h32_kv8_d128 | gqa_ragged | β |
gemm_n6144_k5120 | gemm | β |
gemm_n5120_k4096 | gemm | β |
gemm_n28672_k5120 | gemm | β |
gemm_n5120_k14336 | gemm | β |
top_k_sampling_from_probs_v131072 | sampling | β |
top_k_top_p_sampling_from_probs_v131072 | sampling | β |
top_p_sampling_from_probs_v131072 | sampling | β |
GLM-4.6
Architecture: Dense transformer with Dual Chunk Attention (DCA) β a variant of full attention with rotary embeddings. Served on Together AI and Fireworks; sgl-cookbook shows TP=8, EP=8 (high-throughput configuration), suggesting a very large MoE variant.Note: Exact architecture parameters for GLM-4.6 require verification from the HuggingFace config.json (zai-org/GLM-4.6). The params below are based on the SGLangglm4.pydefaults and may not reflect the actual model dimensions.
| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h4096 | rmsnorm | β (if hidden=4096) |
fused_add_rmsnorm_h4096 | rmsnorm | β |
| GQA or custom DCA attention | attention | β |
| MoE FFN (if applicable) | moe | β |
| Sampling (vocab TBD) | sampling | β |
/track-models --model-name glm46 --hf-repo-id zai-org/GLM-4.6 to fetch the exact config and update this section.
MiniMax M2 / Text-01
Architecture: Hybrid linear + softmax attention with MoE FFN. Uses a 7:1 ratio of Lightning Attention (linear) to standard Softmax Attention layers per 8-layer block, plus sparse MoE (32 experts, top-2 routing). Total parameters: ~456B with ~45.9B activated. 80 decoder layers, 64 attention heads, head_dim=128 (hiddenβ8192). Lightning Attention is a novel linear attention variant that does not use the standard softmax attention mechanism. It is not currently supported by FlashInfer and requires a new op type.| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h8192 | rmsnorm | β |
fused_add_rmsnorm_h8192 | rmsnorm | β |
| Lightning Attention layers (7/8 of all layers) | lightning_attn | β (op type not supported) |
| Softmax Attention layers (1/8 of all layers) | gqa_paged | β |
| MoE experts (top-2, 32 experts) | moe | β |
| Sampling (vocab TBD) | sampling | β |
lightning_attn op type would first need to be defined.
Gemma 3 27B
Architecture: 62 decoder layers, GQA attention (2:1 ratio, 32 q-heads / 16 kv-heads, explicit head_dim=128 decoupled from hidden_size=5376), dense MLP with GeGLU activation. Note:hidden_size=5376 is non-standard; head_dim is explicitly 128 (not 5376/32=168). This is a multimodal model (vision+text) but the language backbone uses standard transformer attention.
| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h5376 | rmsnorm | π‘ |
fused_add_rmsnorm_h5376 | rmsnorm | π‘ |
gqa_paged_prefill_causal_h32_kv16_d128_ps1 | gqa_paged | π‘ |
gqa_paged_prefill_causal_h32_kv16_d128_ps64 | gqa_paged | π‘ |
gqa_paged_decode_h32_kv16_d128_ps1 | gqa_paged | π‘ |
gqa_paged_decode_h32_kv16_d128_ps64 | gqa_paged | π‘ |
gqa_ragged_prefill_causal_h32_kv16_d128 | gqa_ragged | π‘ |
gemm_n4096_k5376 | gemm (q_proj) | π‘ |
gemm_n2048_k5376 | gemm (k/v proj) | π‘ |
gemm_n5376_k4096 | gemm (o_proj) | π‘ |
gemm_n21504_k5376 | gemm (gate/up proj) | π‘ |
gemm_n5376_k21504 | gemm (down proj) | π‘ |
top_k_sampling_from_probs_v262208 | sampling | π‘ |
top_k_top_p_sampling_from_probs_v262208 | sampling | π‘ |
top_p_sampling_from_probs_v262208 | sampling | π‘ |
Qwen3 14B
Architecture: 40 decoder layers, GQA attention (5:1 ratio, 40 q-heads / 8 kv-heads, head_dim=128), dense MLP. Standard serving configuration: TP=2 (from sgl-cookbook), giving 20 q-heads and 4 kv-heads per device.| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h5120 | rmsnorm | π‘ |
fused_add_rmsnorm_h5120 | rmsnorm | π‘ |
gqa_paged_prefill_causal_h20_kv4_d128_ps1 | gqa_paged TP=2 | π‘ |
gqa_paged_prefill_causal_h20_kv4_d128_ps64 | gqa_paged TP=2 | π‘ |
gqa_paged_decode_h20_kv4_d128_ps1 | gqa_paged TP=2 | π‘ |
gqa_paged_decode_h20_kv4_d128_ps64 | gqa_paged TP=2 | π‘ |
gqa_ragged_prefill_causal_h20_kv4_d128 | gqa_ragged TP=2 | π‘ |
gemm_n7168_k5120 | gemm (qkv_proj combined) | π‘ |
gemm_n5120_k5120 | gemm (o_proj) | π‘ |
gemm_n34816_k5120 | gemm (gate_up combined) | π‘ |
gemm_n5120_k17408 | gemm (down proj) | π‘ |
top_k_sampling_from_probs_v151936 | sampling | β |
top_k_top_p_sampling_from_probs_v151936 | sampling | β |
top_p_sampling_from_probs_v151936 | sampling | β |
NemotronH 47B
Architecture: 52 decoder layers total β hybrid of standard GQA (Transformer) and Mamba2 SSM layers. Uses 20 GQA attention layers and 32 Mamba2 layers in an interleaved pattern. Standard serving configuration: TP=8 (from sgl-cookbook). Mamba2 SSM (Structured State Space Model) is a linear recurrent architecture that does not use softmax attention. It maintains a fixed-size state matrix updated at each step, analogous to a hidden state in RNNs. Mamba2 is not currently supported as an op type in FlashInfer-Bench and requires defining a newmamba_ssu (Selective State-space Unit) operation type before this model can be tracked.
| Definition | Op Type | Status |
|---|---|---|
rmsnorm_h{hidden} | rmsnorm | β (dims TBD) |
| GQA attention layers (20 layers, TP=8) | gqa_paged | β |
| Mamba2 SSM layers (32 layers) | mamba_ssu | β (op type not supported) |
| MLP / MoE FFN | gemm / moe | β |
| Sampling | sampling | β |
mamba_ssu op type schema would first need to be defined. Once that exists, the GQA attention layers could reuse existing definitions if dimensions match.

