China)
MIT
106B / 12B (MoE) and 9B Flash (dense),First GLM with native Function Calling integration; multimodal vision-language model based on GLM-4.5-Air; 128K context; SOTA on 42 public vision-language benchmarks; multimodal document understanding (processes up to 128K tokens of multi-document input as images); frontend replication with pixel-accurate HTML/CSS from UI screenshots; visual grounding with bounding boxes; natural language-driven UI edits; strong in image/video reasoning, GUI agent tasks, and complex chart/long-document analysis; bridges visual perception and executable action for real-world multimodal agents.