MiniMax M2

MiniMax M2 model. 62 decoder layers, GQA attention (48 q-heads / 8 kv-heads), sparse MoE (256 experts, top-8 sigmoid routing).