16B / 2.8B (MoE)
MIT
Efficient multimodal vision-language model; 128K context; native-resolution MoonViT encoder for ultra-high-res images; strong on long video (64.5 LongVideoBench) and document understanding (35.1 MMLongBench-Doc); excels in OCR (83.2 InfoVQA), agent tasks (OSWorld), and multi-image reasoning; competes with GPT-4o-mini and Qwen2.5-VL-7B; includes Kimi-VL-Thinking variant with long CoT for enhanced multimodal reasoning (61.7 MMMU, 71.3 MathVista).