flashinfer.rope.apply_rope_with_cos_sin_cache_inplace.
Variants:
- Full RoPE: rotary dimension equals head size (
rotary_dim == head_size) - Partial RoPE: rotary dimension is less than head size (
rotary_dim < head_size)
- NeoX-style (
is_neox=True): split first/second half of rotary dimensions - GPT-J interleaved (
is_neox=False): rotate even/odd indices
num_tokens: variablenum_qo_heads: variablenum_kv_heads: variablemax_seq_len: variablehead_size: constantrotary_dim: constant
q: [num_tokens, num_qo_heads, head_size]k: [num_tokens, num_kv_heads, head_size]cos_sin_cache: [max_seq_len, rotary_dim] (float32, first half cos, second half sin)positions: [num_tokens] (int64, index into cos_sin_cache)
rope_with_cos_sin_cache_neox_style_d128_rd64 vs rope_with_cos_sin_cache_gptj_style_d128_rd64.
Outputs (2 tensors, in-place):
q_out: [num_tokens, num_qo_heads, head_size]k_out: [num_tokens, num_kv_heads, head_size]

