48B / 3B (MoE)
MIT
Hybrid linear attention architecture with Kimi Delta Attention (KDA); 3:1 KDA-to-global MLA ratio; outperforms full attention across short/long-context and RL tasks; 75% KV cache reduction; 6x faster decoding at 1M context; trained on 5.7T tokens.