96 models tracked
| Date | Model | Developer | Parameters | License | Key Features | Paper |
|---|---|---|---|---|---|---|
| December 29 | HyperCLOVA X SEED 8B Omni | Naver Cloud (South Korea) | 8B (dense any-to-any model) | Apache 2.0 | First native omnimodal architecture from Naver; Korean-centered any-to-any model processing and generating across text, images, and audio in unified architecture; eliminates modality barriers through integrated reasoning workflows; seamlessly handles explanations, conversations, visual analysis, and voice interactions; text-based image generation and editing functions; built on deep Korean language understanding with exceptional … | Link |
| December 29 | HyperCLOVA X SEED 32B Think | Naver Cloud (South Korea) | 32B (dense vision-language model) | Apache 2.0 | Advanced vision-language reasoning model scaling beyond SEED 14B Think; unified Transformer architecture processing text tokens and visual patches in shared embedding space; multimodal capabilities across text, images, and video with 128K context window; optional thinking mode for deep controllable reasoning; knowledge cutoff May 2025; strengthens Korean-centric reasoning and agentic capabilities … | Link |
| December 29 | Llama 3.3 8B Instruct | Meta (USA) | 8B (dense) | Llama 3.3 Community License | The "lost" Llama 3.3 8B extracted from Meta's Llama API via finetuning workaround; model existed behind API since April 2025 but weights never officially released; extracted by downloading finetuned model and subtracting adapter to recover base; significant improvements over Llama 3.1 8B: 81.95% IFEval (vs 78.2%), 37.0% GPQA Diamond (vs … | Link |
| December 29 | WeDLM-8B | Tencent (China) | 8B (2 variants: Base, Instruct; dense) | Apache 2.0 | First production diffusion language model with standard causal attention; initialized from Qwen3-8B; introduces Topological Reordering for parallel mask recovery under causal attention + Streaming Parallel Decoding for continuous prefix commitment; 3-6× faster than vLLM-optimized Qwen3-8B on math reasoning (GSM8K), 2-3× faster on code generation, 1.5-2× faster on open-ended QA; outperforms … | Link |
| December 25 | MiniMax M2.1 | MiniMax (China) | 230B / 10B (MoE with 23:1 sparsity ratio) | Apache 2.0 | Enhanced successor to M2 focused on multi-language programming and real-world complex tasks; 74% SWE-bench Verified, 72.5% SWE-bench Multilingual, 88.6% VIBE benchmark (Visual & Interactive Benchmark for Execution); outperforms Claude Sonnet 4.5 and approaches Claude Opus 4.5 on multilingual coding; exceptional multi-language capabilities across Rust, Java, Go, C++, Kotlin, Objective-C, TypeScript, … | Link |
| December 25 | LFM2-2.6B-Exp | Liquid AI (USA) | 2.6B (dense) | Apache 2.0 | Experimental checkpoint built on LFM2-2.6B using pure reinforcement learning; hybrid architecture with 10 double-gated short-range convolution blocks + 6 Grouped Query Attention (GQA) blocks; specifically trained on instruction following, knowledge, and math; achieves 82.41% GSM8K, 79.56% IFEval, 42% GPQA; IFBench score surpasses DeepSeek R1-0528 (a model 263× larger); 3× faster … | Link |
| December 22 | GLM-4.7 | Zhipu AI (China) | 358B / ~32B (MoE) | MIT | Latest flagship model with major improvements in coding and creative writing; Core Coding: 73.8% SWE-bench (+5.8%), 66.7% SWE-bench Multilingual (+12.9%), 41% Terminal Bench 2.0 (+10%), 84.9% LiveCodeBench v6; Vibe Coding: cleaner/more modern webpages and better-looking slides with accurate layout/sizing; Complex Reasoning: 42.8% HLE with tools (+12.4%), 95.7% AIME 2025, 97.1% … | Link |
| December 18 | Hearthfire-24B | LatitudeGames (USA) | 24B (dense) | Apache 2.0 | Narrative longform writing model designed to embrace quiet moments and atmosphere; based on Mistral Small 3.2 Instruct; philosophy of 'vibes over velocity' prioritizing introspection and slow burn over constant action; deliberately slower-paced with cooperative and atmospheric tone (vs Wayfarer's grit and consequence); trained with SFT on single dataset of thousands … | Link |
| December 18 | FunctionGemma | Google (USA) | 270M (dense) | Gemma License | Specialized Gemma 3 270M fine-tuned for unified chat and function calling; translates natural language into structured API calls while maintaining conversational ability; achieves 85% accuracy on Mobile Actions benchmark after fine-tuning (vs 58% baseline); designed for edge deployment on mobile phones and devices like NVIDIA Jetson Nano; runs fully offline … | Link |
| December 16 | T5Gemma 2 | Google (USA) | 270M-270M, 1B-1B, 4B-4B (3 encoder-decoder sizes) | Apache 2.0 | Next generation of T5Gemma with multimodal and long-context capabilities; extends T5Gemma's adaptation recipe (UL2) from text-only to multimodal based on Gemma 3; processes text and vision inputs; introduces tied word embeddings (shares all embeddings across encoder and decoder for efficiency) and merged attention (unifies decoder self-attention and cross-attention into single … | Link |
| December 16 | Nemotron-Cascade | NVIDIA (USA) | 8B / 14B (3 variants: 8B unified, 8B-Thinking, 14B-Thinking; dense, post-trained from Qwen3) | NVIDIA Open Model License | General-purpose reasoning models trained with novel Cascade RL (sequential domain-wise reinforcement learning); 14B-Thinking outperforms DeepSeek-R1-0528 (671B) on LiveCodeBench v5/v6/Pro; achieves silver-medal performance on 2025 IOI (International Olympiad in Informatics); 8B models match DeepSeek-R1-0528 on LiveCodeBench despite being 80× smaller; beats Gemini 2.5 Pro, o4-mini, Qwen3-235B on coding benchmarks; unified 8B … | Link |
| December 16 | MiMo-V2-Flash | Xiaomi (China) | 309B / 15B (MoE with 256 experts; 8 active) | MIT | Frontier-class foundation model excelling in reasoning, coding, and agentic workflows; #1 open-source on SWE-Bench Verified (73.4%) and SWE-Bench Multilingual (71.7%); 94.1% AIME 2025 (top 2 open-source); hybrid attention architecture with 5:1 SWA:GA ratio using aggressive 128-token sliding window; 6× reduction in KV-cache vs full attention; 256K context; trained on 27T … | Link |
| December 15 | QwenLong-L1.5 | Alibaba (Qwen Team) | 30B / 3B (MoE) | Apache 2.0 | Long-context reasoning model based on Qwen3-30B-A3B-Thinking with memory management for ultra-long contexts (1M-4M tokens); three core innovations: (1) Multi-hop reasoning data synthesis pipeline that moves beyond needle-in-haystack tasks to generate complex reasoning requiring globally distributed evidence, (2) Adaptive Entropy-Controlled Policy Optimization (AEPO) algorithm for stable long-context RL training with task-balanced … | Link |
| December 15 | Nemotron 3 Nano | NVIDIA (USA) | 31.6B / 3.6B (2 variants: Base | NVIDIA Open Model License | Instruct; hybrid Mamba-Transformer MoE),Breakthrough agentic AI model with hybrid Mamba-2 + Transformer + MoE architecture (activates 6 of 128 experts per pass); 1M-token context window natively; both Base and Instruct (post-trained) variants released; 4x faster throughput than Nemotron 2 Nano; 3.3x higher throughput than Qwen3-30B-A3B and 2.2x vs GPT-OSS-20B on … | Link |
| December 12 | OLMo 3.1 | Allen Institute for AI (USA) | 32B Think / 32B Instruct / 7B RL-Zero variants (3 model types; dense) | Apache 2.0 | Extended training of OLMo 3 with additional 21 days on 224 GPUs; Think 32B outperforms Qwen3-32B on AIME 2025 and performs close to Gemma 27B; Instruct 32B is strongest fully open 32B-scale instruct model; substantial improvements: +5 points AIME, +4 points ZebraLogic, +4 points IFEval, +20 points IFBench; beats Gemma … | Link |
| December 11 | LLaDA 2.0 | inclusionAI / Ant Group (China) | 16B / 1.4B mini and 100B / 6.1B flash (2 variants; MoE) | Apache 2.0 | First diffusion language model (dLLM) scaled to 100B parameters; uses iterative refinement approach instead of autoregressive generation (starts with fully masked sequence and unmasks tokens in parallel across multiple rounds); 2.1x faster inference than comparable AR models (535 tokens/s); trained on ~20T tokens; excels at code generation, complex reasoning, and … | Link |
| December 9 | Nomos 1 | Nous Research (USA) | 31B (fine-tune of Qwen3-30B-A3B-Thinking-2507; MoE) | Apache 2.0 | Specialized mathematical reasoning model for problem-solving and proof-writing in natural language; developed in collaboration with Hillclimb AI; scores 87/120 on Putnam 2025 (base model only achieves 24/120 - 3.6x improvement); designed to work with Nomos Reasoning Harness (open-sourced concurrently); significant advancement in domain-specific mathematical capabilities; demonstrates power of targeted fine-tuning … | Link |
| December 9 | Devstral 2 | Mistral AI (France) | 123B Devstral 2 and 24B Small 2 (2 variants; dense) | Modified MIT (Devstral 2) / Apache 2.0 (Small 2) | Next-generation agentic coding model family; 256K context; SOTA open-weight on SWE-bench Verified (72.2%, huge jump from original Devstral's 46.8%); 7x more cost-efficient than Claude Sonnet for real-world coding tasks; business context awareness similar to Le Chat's conversational memory; ships with Mistral Vibe CLI for natural language code automation and vibe … | Link |
| December 8 | GLM-4.6V | Zhipu AI (Z.ai | China) | MIT | 106B / 12B (MoE) and 9B Flash (dense),First GLM with native Function Calling integration; multimodal vision-language model based on GLM-4.5-Air; 128K context; SOTA on 42 public vision-language benchmarks; multimodal document understanding (processes up to 128K tokens of multi-document input as images); frontend replication with pixel-accurate HTML/CSS from UI screenshots; visual … | Link |
| December 5 | Rnj-1 | Essential AI (USA) | 8.3B (dense) | Apache 2.0 | First model from Essential AI (founded by Ashish Vaswani); exceptional code generation and agentic capabilities; leads 8B class on SWE-bench Verified (20.8%, beating Gemini 2.0 Flash and Qwen2.5-Coder 32B); SOTA tool use on BFCL; strong math (AIME) and STEM (GPQA); 32K context with YaRN extension; trained on 8.4T tokens using … | Link |
| December 3 | Hermes 4.3 | Nous Research (USA) | 36B (based on ByteDance Seed-OSS-36B-Base; dense) | Apache 2.0 | First production model trained entirely on Psyche distributed network; matches/exceeds Hermes 4 70B performance at half parameter cost; 512K context (extended from 128K); hybrid reasoning with <think> tags; SOTA on RefusalBench; trained twice (centralized vs distributed) with Psyche version outperforming; uses DisTrO optimizer for internet-scale distributed training secured by Solana … | Link |
| December 2 | Ministral 3 | Mistral AI (France) | 3B / 8B / 14B (3 sizes × 3 variants: Base | Apache 2.0 | Instruct, Reasoning; dense),Multimodal edge-optimized family (text + vision); 128K-256K context; single GPU deployment; Base for foundation tasks, Instruct for chat/assistants, Reasoning for complex logic; 14B Reasoning achieves 85% on AIME 2025; can run on laptops/phones/drones; efficient token generation. | Link |
| December 2 | Mistral Large 3 | Mistral AI (France) | 675B / 41B (MoE) | Apache 2.0 | First open-weight frontier model with unified multimodal (text + image) and multilingual capabilities; granular MoE architecture; 256K context window; excels in long-document understanding, agentic workflows, coding, and multilingual processing; trained on 3000 H200 GPUs; ranked #2 in OSS non-reasoning on LMArena. | Link |
| December 1 | DeepSeek V3.2 | DeepSeek AI (China) | 671B / 37B (2 variants: standard | MIT | Speciale; MoE),First DeepSeek to integrate thinking into tool-use; hybrid thinking/non-thinking modes; standard version reaches GPT-5 level (93.1% AIME, 92.5% HMMT); Speciale variant for extreme reasoning with gold medals in IMO/CMO/ICPC/IOI 2025 (99.2% HMMT, 35/42 IMO); combines theorem-proving from Math-V2; massive agent training (1,800+ environments). | Link |
| December 1 | Trinity | Arcee AI (USA) | 6B / 1B Nano (MoE) and 26B / 3B Mini (MoE) | Apache 2.0 | U.S.-trained MoE family with AFMoE architecture; 128K context; trained on 10T tokens; Nano (6B/1B) for chat with personality and on-device AI; Mini (26B/3B) for high-throughput reasoning, function calling, and agent workflows; strong on MMLU and BFCL V3. | Link |
| November 27 | DeepSeek-Math-V2 | DeepSeek AI (China) | 685B (built on DeepSeek-V3.2-Exp-Base) | Apache 2.0 | Self-verifying mathematical reasoning model with verifier-generator dual architecture; gold medal IMO 2025 (5/6 problems, 83.3%) and CMO 2024 (73.8%); near-perfect Putnam 2024 (118/120 points); IMO-ProofBench: 99% basic, 61.9% advanced; combines theorem-proving with self-verification; scales verification compute. | Link |
| November 26 | INTELLECT-3 | Prime Intellect (USA) | 106B / 12B (MoE) | MIT | Post-trained on GLM-4.5-Air-Base using SFT and RL; trained on 512 H200 GPUs with prime-rl framework; SOTA performance for size on math (90.8% AIME 2024), code, and reasoning; fully open-sourced with complete RL stack and environments. | Link |
| November 21 | Nanbeige4-3B-Thinking-2511 | Nanbeige LLM Lab / BOSS Zhipin (China) | 3B (dense) | Apache 2.0 | Small reasoning model with exceptional performance-to-size ratio; outperforms Qwen3-32B on AIME 2024 (90.4 vs 81.4) and GPQA-Diamond (82.2 vs 68.7); trained on 23T tokens with novel Fine-Grained Warmup-Stable-Decay (FG-WSD) technique; ranks #11 on WritingBench and #15 on EQBench3; scores 60 on Arena-Hard V2; SOTA open-source under 32B parameters on multiple … | Link |
| November 20 | OLMo 3 | Allen Institute for AI (USA) | 7B / 32B (multiple variants: Base, Think, Instruct, RL Zero; dense) | Apache 2.0 | Fully open model family trained on Dolma 3 (6T tokens); 65K context; Base for foundation tasks; Think for explicit reasoning (matches Qwen 3 on MATH); Instruct for chat/tool use; RL Zero for research; competitive with Qwen 2.5/Gemma 3; complete transparency from data to deployment; first fully open 32B thinking model. | Link |
| November 12 | Baguettotron | PleIAs (France) | 321M (dense) | Apache 2.0 | Small reasoning model with ultra-deep 80-layer "baguette" architecture; trained on 200B tokens of fully synthetic SYNTH dataset; native thinking traces with stenographic notation; best-in-class for size on MMLU, GSM8K, HotPotQA; multilingual (French, German, Italian, Spanish, Polish); trained on only 16 H100s; RAG-optimized with source grounding. | Link |
| November 6 | Kimi K2 Thinking | Moonshot AI (China) | 1T / 32B (MoE) | Modified MIT | Thinking agent with step-by-step reasoning and dynamic tool use; 256K context; SOTA on HLE (44.9% w/ tools) and BrowseComp (60.2%); 200-300 sequential tool calls; native INT4 quantization for 2x speed; excels at agentic coding/workflows; tops SWE-Bench Verified (71.3%). | |
| October 31 | Kimi Linear | Moonshot AI (China) | 48B / 3B (MoE) | MIT | Hybrid linear attention architecture with Kimi Delta Attention (KDA); 3:1 KDA-to-global MLA ratio; outperforms full attention across short/long-context and RL tasks; 75% KV cache reduction; 6x faster decoding at 1M context; trained on 5.7T tokens. | Link |
| October 27 | Ming Omni | Inclusion (AntLingAGI | China) | MIT | 103B / 9B Flash (MoE) and 19B Lite (dense),Omni-modal family: Flash-preview for any-to-any (text, image gen, audio/video) with sparse MoE on Ling-Flash-2.0, high-fidelity text rendering; Lite (v1.5) lightweight full-modal for edge deployment with fast inference. | |
| October 27 | MiniMax-M2 | MiniMax AI (China) | 230B / 10B (MoE) | Open (permissive) | Compact MoE for coding/agentic workflows; multi-file edits, coding-run-fix loops, toolchains; low latency/high throughput; supports <think> format; outperforms peers on SWE-bench/Terminal-Bench. | |
| October 21 | Qwen3-VL | Alibaba (Qwen Team) | 2B / 32B (2 sizes; dense; Instruct only) | Apache 2.0 | Additional VL sizes: 2B ultra-compact for edge devices with minimal VRAM; 32B mid-large excels in long-doc/video, screenshot-to-code; same 256K→1M context and multimodal capabilities as earlier releases. | |
| October 15 | Qwen3-VL | Alibaba (Qwen Team) | 4B / 8B (2 sizes; dense; Instruct and Thinking variants) | Apache 2.0 | Vision-language family with 256K→1M context; OCR, spatial grounding (2D/3D), visual coding, GUI agents; 32-language OCR; FP8 optimized for low VRAM; Thinking variants enhance multimodal reasoning/STEM; strong in long-doc/video comprehension. | |
| October 13 | Ring-1T | Inclusion (AntLingAGI | China) | MIT | 1T (MoE),Full release of trillion-param thinking model on Ling 2.0 arch; silver-level IMO (solved Problem 3); tops AIME '25 (92.6%), CodeForces, ARC-AGI; RLVR/IcePop tuning for stable multi-step reasoning/agents. | |
| October 9 | Ling-1T | Ant Group (Inclusion/AntLingAGI | China) | MIT | 1T (MoE),Flagship trillion-parameter general-purpose model; hybrid Syntax–Function–Aesthetics reward for code gen; strong in maths/coding; base for Ling family; pretrained on massive data for broad capabilities. | |
| October 8 | Qwen3 Omni | Alibaba (Qwen Team) | 30B (2 variants: standard | Apache 2.0 | Realtime; dense),End-to-end omni-modal (text/image/audio/video); unified architecture with Thinker/Talker MoEs for reasoning/speech gen; 58% Big Bench Audio; 119 text langs, 19 speech in/10 out; 17 voice options; Realtime variant for low-latency speech-to-speech (0.9s first audio). | |
| September 30 | GLM-4.6 | Zhipu AI (Z.ai | China) | MIT | 355B / 32B (MoE),Flagship upgrade to GLM-4.5; 200K context; ties Sonnet 4.5 on agentic/reasoning/coding benchmarks (tops AIME '25, LiveCodeBench v6); enhanced tool-use, search workflows, writing, translation; 30%+ token efficiency gains. | |
| September 29 | Ring-1T-preview | Inclusion (AntLingAGI | China) | MIT | 1T (MoE),World's first open-source 1T-param reasoning model; pretrained on 20T tokens, tuned with RLVR/IcePop for stable multi-step thinking; tops AIME 2025 (92.6), CodeForces, ARC-AGI; solved IMO 2025 Problem 3 in one shot via AWorld agents; hybrid MoE from Ling 2.0 lineage. | |
| September 29 | DeepSeek-V3.2-Exp | DeepSeek AI (China) | 671B / 37B (MoE) | MIT | Experimental update to V3.1-Terminus; introduces DeepSeek Sparse Attention (DSA) for fine-grained sparse processing; major efficiency gains in long-context training/inference (e.g., adaptive expert routing, better memory); maintains near-identical quality to prior versions. | |
| September 23 | Qwen3-VL-235B-A22B | Alibaba (Qwen Team) | 235B / 22B (2 variants: Instruct | Apache 2.0 | Thinking; MoE),Flagship vision-language model; Instruct variant outperforms Gemini 2.5 Pro on visual perception, GUI navigation, screenshot-to-code; Thinking variant SOTA on multimodal reasoning/STEM with deep causal analysis; 256K+ context for videos/PDFs; 32-lang OCR and 2D/3D spatial reasoning. | |
| September 22 | DeepSeek-V3.1-Terminus | DeepSeek AI (China) | 671B / 37B (MoE) | MIT | Update to V3.1; improved language consistency (fewer CN/EN mix-ups); enhanced Code/Search Agent performance; hybrid modes for reasoning (up to 64K tokens); stronger benchmarks in agentic tasks (e.g., SimpleQA: 96.8). | |
| September 10 | Qwen3-Next-80B-A3B | Alibaba (Qwen Team) | 80B / 3B (MoE) | Apache 2.0 | Next-gen MoE variant; excels in complex reasoning; outperforms larger models in efficiency benchmarks. | |
| September 5 | Kimi K2-Instruct-0905 | Moonshot AI (China) | 1T / 32B (MoE) | Apache 2.0 | Update to K2; enhanced agentic coding, front-end dev, and tool-calling; 256K context; improved integration with agents. | |
| September 3 | Nova-70B-Llama-3.3 | LatitudeGames (USA) | 70B (dense) | Llama 3.3 License | Narrative-focused 70B roleplay model trained on Llama 3.3 70B Instruct; built with same techniques as Muse-12B emphasizing relationships and character development; trained on multiple datasets combining text adventures (Wayfarer-style), long emotional narratives, detailed worldbuilding, and general roleplay; all data rewritten to eliminate common AI clichés; small single-turn instruct dataset included; … | Link |
| September 3 | Wayfarer-2-12B | LatitudeGames (USA) | 12B (dense) | Apache 2.0 | Sequel to original Wayfarer based on Mistral Nemo Base; refined formula with slower pacing and increased response length/detail; death is now possible for ALL characters (not just user); SFT training with three-ingredient recipe: Wayfarer 2 dataset, sentiment-balanced roleplay transcripts, and small instruct core to retain instructional capabilities; maintains pessimistic emotional … | Link |
| September 1 | Wayfarer-Large-70B-Llama-3.3 | LatitudeGames (USA) | 70B (dense) | Llama 3.3 License | Flagship 70B adventure roleplay model trained on Llama 3.3 70B Instruct; trained with 33/33/33 mixture of 8K text adventure data, 4K roleplay data, and SlimOrca Sonnet subset; SlimOrca instruct subset critical for emphasizing difference between instruct and fiction while amplifying Wayfarer's negative sentiment; regenerated training data from ground up to … | Link |
| August 25 | Hermes 4 | Nous Research | 14B / 70B / 405B (3 sizes; dense) | Apache 2.0 | Hybrid reasoning family (multi-step CoT + instruction-following); based on Llama 3.1; neutral alignment, uncensored; excels in math, coding, roleplay, and long-context retention; agentic function-calling; 405B flagship offers frontier-level performance with 40K+ context. | |
| August 23 | Grok-2 | xAI | 270B / 115B (MoE) | Apache 2.0 | Open-sourced weights from 2024 model; advanced reasoning and humor-infused responses; multimodal capabilities added in updates. | |
| August 28 | Command A Translate | Cohere Labs (Canada) | 111B (dense) | CC-BY-NC | First dedicated machine translation model from Cohere; achieves SOTA translation quality across 23 languages; introduces Deep Translation agentic workflow for iterative refinement; 16K context (8K in + 8K out); outperforms GPT-5, DeepSeek V3, DeepL Pro; enterprise-focused with private deployment options. | Link |
| August 21 | Command A Reasoning | Cohere Labs (Canada) | 111B (dense) | CC-BY-NC | First Cohere reasoning model with controllable token-budget thinking; excels at complex agentic tasks, tool use, and multilingual reasoning (23 languages); 256K context; hybrid mode (reasoning on/off); outperforms DeepSeek R1 and gpt-oss on enterprise benchmarks; powers North platform. | Link |
| August 21 | DeepSeek V3.1 | DeepSeek AI (China) | 671B / 37B (MoE) | MIT | Hybrid modes (thinking/non-thinking); pricing optimizations; improved multilingual and safety performance. | |
| August 20 | Seed-OSS-36B | ByteDance Seed Team (China) | 36B (3 variants: Base w/ synthetic | Apache 2.0 | Base w/o synthetic, Instruct; dense),Native 512K context window (4x mainstream models); "Thinking Budget" mechanism for flexible reasoning depth control (512 to 16K tokens); trained on only 12T tokens yet achieves SOTA on multiple benchmarks; SOTA open-source on AIME24 (91.7%), LiveCodeBench v6 (67.4), RULER 128K (94.6); research-friendly dual base release (with/without … | Link |
| August 14 | Gemma 3 (270M) | Google DeepMind | 270M (dense) | Open (permissive) | Ultra-compact text-only model for task-specific fine-tuning; 256K token vocabulary (170M embeddings + 100M transformer blocks); extreme energy efficiency (0.75% battery per 25 conversations on Pixel 9 Pro); strong instruction-following; QAT INT4 checkpoints; designed for on-device deployment, text classification, entity extraction; can run in browsers. | |
| August 6 | Qwen3-8B | Alibaba (Qwen Team) | 8B (dense) | Apache 2.0 | Compact dense model in Qwen3 series; suitable for on-device inference; strong in coding and multilingual support. | |
| August 5 | GPT-OSS | OpenAI (USA) | ~120B / 5.1B (MoE) and ~21B / 3.6B (MoE) | Apache 2.0 | First OpenAI open-weight release since GPT-2 (2019); both models use MoE architecture with MXFP4 quantization; 120B for reasoning and complex tasks (fits single H100); 21B for lightweight applications and on-device deployment (runs on 16GB); 128K context; adjustable reasoning effort levels; strong on code generation and structured reasoning. | Link |
| July 31 | Command A Vision | Cohere (Canada) | ~111B (est.) | Commercial | First commercial Cohere model with vision capabilities (text + image); 128K context; enterprise-focused for document analysis, chart interpretation, OCR; supports up to 20 images per request; multilingual support (English, French, German, Italian, Portuguese, Spanish). | Link |
| July 25 | GLM-4.5 | Zhipu AI (Z.ai | China) | MIT | 355B / 32B (MoE) and 106B / 12B Air (2 variants: standard, Air; MoE),Hybrid reasoning family (thinking/non-thinking modes); standard version excels in agentic coding, tool use, and complex tasks; Air variant for efficient deployment with lower resource needs; 128K context; strong in reasoning and multilingual support. | |
| July 23 | HyperCLOVA X SEED 14B Think | Naver Cloud (South Korea) | 14B (dense) | Apache 2.0 | First open-source HyperCLOVA X reasoning model with advanced AI agent capabilities; trained at 1% cost of comparable global models (52.6× lower than Qwen2.5-14B, 91.38× lower than Qwen3-14B) through parameter pruning and knowledge distillation; multi-stage RL pipeline: SFT → RLVR (Reinforcement Learning with Verifiable Rewards) → Length Controllability → RLHF; solves … | Link |
| July 22 | Qwen3-Coder-480B-A35B-Instruct | Alibaba (Qwen Team) | 480B / 35B (MoE with 160 experts; 8 active) | Apache 2.0 | Advanced agentic coding model with 256K native context (extends to 1M); trained on 7.5T tokens (70% code); long-horizon RL with 20K parallel environments; SOTA on SWE-Bench Verified; supports 100+ languages; includes Qwen Code CLI tool. | Link |
| July 22 | Qwen3-235B-A22B-Instruct-2507 | Alibaba (Qwen Team) | 235B / 22B (MoE) | Apache 2.0 | Major instruct update to flagship; enhanced instruction-following and task-specific fine-tuning. | |
| July 19 | OpenReasoning-Nemotron | NVIDIA (USA) | 1.5B / 8B / 32B (3 sizes; dense) | Apache 2.0 | Distilled reasoning suite from DeepSeek R1-0528; SOTA in math/science/code (GPQA, MMLU-PRO, AIME 2025); tops LiveCodeBench/SciCode; supports TensorRT-LLM/NeMo integration; optimized for Hugging Face Transformers and ONNX deployment; commercially permissive. | |
| July 16 | Voxtral | Mistral AI (France) | 24B Small and 3B Mini (2 sizes; dense) | Apache 2.0 | Audio LLM family; Small (24B) transcribes 30-min audio, understands 40-min with Q&A/summarization; Mini (3B) lightweight for edge ASR tasks with automatic lang detection; multilingual ASR + LLM backbone based on Small 3.1; optimized for European languages. | |
| July 16 | Kimi K2 | Moonshot AI (China) | 1T / 32B (MoE) | Apache 2.0 | Agentic intelligence focus; state-of-the-art in creative writing and long-context tasks; open-source for experimentation. | |
| July 10 | LFM2 | Liquid AI (USA) | 350M, 700M, 1.2B, 2.6B (4 dense) + 8B-A1B MoE (8.3B total / 1.5B active) | Apache 2.0-based (free for <$10M revenue) | Hybrid architecture (10 double-gated short-range convolution blocks + 6 GQA blocks); 3× faster training than previous LFM generation; 2× faster decode/prefill on CPU vs Qwen3; edge/on-device deployment focus (smartphones, laptops, vehicles); outperforms Qwen3, Gemma 3, Phi-4-Mini in size classes; pre-trained on 10-12T tokens with 32K-context mid-training; supports creative writing, agentic … | Link |
| July 8 | T5Gemma | Google (USA) | 2B-2B, 9B-2B, 9B-9B (Gemma 2 Series) and Small/Base/Large/XL/ML (T5-compatible Series; encoder-decoder) | Apache 2.0 | First encoder-decoder models adapted from Gemma 2 via novel adaptation technique; converts pretrained decoder-only models into encoder-decoder architecture using UL2 or PrefixLM training; achieves comparable/better performance than Gemma 2 counterparts while dominating quality-efficiency frontier; T5Gemma 2B-2B IT gains +12 points MMLU and +12.7% GSM8K over Gemma 2 2B; flexible unbalanced … | Link |
| July 8 | SmolLM3-3B | Hugging Face (USA) | 3B (dense) | Apache 2.0 | Compact multilingual reasoning model with dual-mode (think/no_think); trained on 11.2T tokens; supports 128K context (6 languages); outperforms Llama 3.2 3B and Qwen2.5 3B; competitive with 4B models; GQA and NoPE architecture. | Link |
| June 20 | Mistral Small 3.2 | Mistral AI (France) | 24B (dense) | Apache 2.0 | Maintenance release focused on targeted refinements; enhanced instruction-following (84.78% accuracy vs 82.75% in v3.1); reduced infinite/repetitive generations by ~50% (1.29% vs 2.11%); improved function calling template for robust tool-use scenarios; major gains on Wildbench v2 (65.33% vs 55.6%) and Arena Hard v2 (43.1% vs 19.56%); enhanced STEM performance (HumanEval Pass@5: … | Link |
| June 16 | MiniMax-M1 | MiniMax AI (China) | 456B / 45.9B (MoE) | Open (permissive) | Hybrid-attention reasoning model; Lightning attention for efficient scaling; 1M token context; RL with CISPO; outperforms DeepSeek-R1 on SWE-bench/GPQA; function calling/agentic tools. | |
| June 10 | Magistral Small | Mistral AI (France) | 24B (dense) | Apache 2.0 | First reasoning model with CoT capabilities; transparent step-by-step thinking; multilingual expert domains; outperforms non-reasoning LLMs in accuracy. | |
| May 28 | DeepSeek R1-0528 | DeepSeek AI (China) | 671B / 37B (MoE) | MIT | Update to R1; reduced hallucinations; improved math/code benchmarks; enhanced frontend integration and agentic capabilities. | |
| May 21 | Falcon Arabic | TII (UAE) | 7B (dense) | Apache 2.0 (TII Falcon License) | First Arabic-focused Falcon model; trained on native Modern Standard Arabic and regional dialects; best-performing Arabic model in its class; matches performance of 70B models; built on Falcon 3-7B. | Link |
| May 21 | Falcon-H1 | TII (UAE) | 500M-34B (6 sizes: 500M | Apache 2.0 (TII Falcon License) | 1.5B, 1.5B-deep, 3B, 7B, 34B),Hybrid Transformer-Mamba architecture; 256K context; multilingual (100+ languages); outperforms Llama and Qwen in its class; optimized for edge deployment; available as NVIDIA NIM microservice. | Link |
| May 21 | Devstral | Mistral AI (France) | ~30B (dense | Apache 2.0 | est.),Agentic coding model for software engineering; tops SWE-Bench Verified (46.8%); handles multi-file repos and complex workflows; collab with All Hands AI. | |
| May 16 | Harbinger-24B | LatitudeGames (USA) | 24B (dense) | Apache 2.0 | Premium adventure roleplay model for immersive stories with real consequences; trained on Mistral Small 3.1 Instruct using two-stage approach (SFT on multi-turn Wayfarer-style text adventures and general roleplay + DPO for narrative coherence); applies same DPO techniques as Muse to reduce clichés and repetitive patterns; focuses on enhancing instruction following, … | Link |
| May 13 | Muse-12B | LatitudeGames (USA) | 12B (dense) | Apache 2.0 | Narrative-focused roleplay model emphasizing polish and coherence; uses DPO (Direct Preference Optimization) to reduce AI clichés and repetitive patterns; designed for immersive storytelling with refined outputs; less punishing than Wayfarer line while maintaining narrative quality; free to use on AI Dungeon; trained to produce more sophisticated and varied narrative responses. | Link |
| May 13 | Wayfarer-12B | LatitudeGames (USA) | 12B (dense) | Apache 2.0 | Adventure role-play model trained for challenging and dangerous text-based experiences; counters positivity bias in modern AI by embracing conflict, failure states, and character death; trained on Mistral Nemo Base using two-stage SFT (180K instruct data + 50/50 mix of synthetic 8K context text adventures and roleplay); data generated by simulating … | Link |
| April 28 | Qwen3 | Alibaba (Qwen Team) | 235B / 22B (MoE) and 30B (dense) | Apache 2.0 | Flagship hybrid reasoning family (thinking/non-thinking modes); 235B MoE flagship with 119 languages support; 30B dense variant for efficient deployment; strong in coding and creative tasks; lower resource needs in dense variant. | |
| April 23 | HyperCLOVA X SEED | Naver Cloud (South Korea) | 3B (multimodal), 1.5B (text), 0.5B (text) | Apache 2.0 | First open-source HyperCLOVA X models released for commercial use under Korea's sovereign AI ecosystem initiative; SEED 3B is multimodal (text+image) designed for Korean linguistic and cultural context understanding with visual data comprehension; outperforms competing models in image/video understanding within Korean contexts; trained on high-quality Korean-centric data with years of accumulated … | Link |
| April 10 | Kimi-VL | Moonshot AI (China) | 16B / 2.8B (MoE) | MIT | Efficient multimodal vision-language model; 128K context; native-resolution MoonViT encoder for ultra-high-res images; strong on long video (64.5 LongVideoBench) and document understanding (35.1 MMLongBench-Doc); excels in OCR (83.2 InfoVQA), agent tasks (OSWorld), and multi-image reasoning; competes with GPT-4o-mini and Qwen2.5-VL-7B; includes Kimi-VL-Thinking variant with long CoT for enhanced multimodal reasoning (61.7 … | Link |
| April 8 | Llama Nemotron Ultra | NVIDIA (USA) | Varies (Llama 4-based) | Apache 2.0 | Advanced reasoning model; leads open-weights on GPQA (76%), AIME math, LiveCodeBench coding; optimized for NVIDIA hardware inference. | |
| April 5 | Llama 4 | Meta AI | 109B / 17B Scout (16 experts, MoE) and 400B / 17B Maverick (128 experts, MoE) | Llama License (open-weight) | Natively multimodal (text, image, video); Scout (109B/17B) with 10M context for accessibility and real-world AI; Maverick (400B/17B) with 1M context for high-performance tasks with scalable inference; both use early fusion architecture. | |
| March 25 | DeepSeek-V3-0324 | DeepSeek AI (China) | 671B / 37B (MoE) | MIT | Update to V3 base; major boost in reasoning and front-end development; improved multilingual benchmarks over predecessor; AIME improved from 39.6% to 59.4%, LiveCodeBench from 39.2% to 49.2%; first open-weights model to lead non-reasoning models on Artificial Analysis Intelligence Index; MIT License. | Link |
| March 18 | Llama Nemotron | NVIDIA (USA) | 8B Nano, 49B Super, 253B Ultra (3 sizes; Llama 4-based) | Apache 2.0 | Open reasoning family on Llama 4; Nano (8B) for PC/edge, Super (49B) for single GPU with best throughput, Ultra (253B) for maximum agentic accuracy; 20% improved accuracy vs base models, 5x faster inference; excels in multi-agent collaboration, workflow automation, and domain-specific fine-tuning; compute-efficient for enterprise agents. | Link |
| March 17 | Mistral Small 3.1 | Mistral AI (France) | 24B (2 variants: Base | Apache 2.0 | Instruct; dense),Adds state-of-the-art vision understanding to Small 3; 128K context window; multimodal (text + vision); first open-source model to surpass leading proprietary models across text, vision, and multilingual capabilities in its weight class; 150 tokens/s; runs on single RTX 4090 or 32GB RAM Mac; multilingual (dozens of languages); both Base … | Link |
| March 13 | Command A | Cohere Labs (Canada) | 111B (dense) | CC-BY-NC | Enterprise-optimized model excelling at tool use, RAG, and agentic tasks; 256K context; 150% higher throughput than Command R+; competitive with GPT-4o and DeepSeek V3; requires only 2 GPUs; strong multilingual support (23 languages). | Link |
| March 12 | Gemma 3 | Google DeepMind | 1B / 4B / 12B / 27B (4 sizes; dense) | Open (permissive) | Multimodal (text + vision) family; optimized for lightweight to enterprise-grade deployment; includes ShieldGemma for content moderation; strong in safety alignments and instruction-following; excels in reasoning and long-context handling; 128K context; supports 140+ languages. | Link |
| March 4 | Aya Vision | Cohere Labs (Canada) | 8B and 32B (2 sizes; dense) | Research License | State-of-the-art multimodal research model (text + image); excels across multiple languages and modalities; outperforms leading open-weight models on language, text, and image benchmarks; supports 23 languages; introduces AyaVisionBench evaluation suite; research use only. | Link |
| February 26 | Phi-4 | Microsoft (USA) | 3.8B mini and 5.6B multimodal (2 variants; dense) | MIT | Small model family: mini (3.8B) compact reasoning for math/coding on mobile devices with GQA; multimodal (5.6B) supports text, image, audio, video via Mixture of LoRAs; outperforms Gemini 2.0 Flash on audio+visual benchmarks; 200K vocab (20+ languages); top Hugging Face OpenASR leaderboard (6.14% WER); 128K context. | Link |
| January 30 | Mistral Small 3 | Mistral AI (France) | 24B (dense) | Apache 2.0 | Efficient base model for low-latency tasks; outperforms Llama 3.3 70B in internal evals; ideal for fine-tuning in automation/agent workflows; no RL/synthetic data used. | Link |
| January 20 | DeepSeek R1 | DeepSeek AI (China) | 671B / 37B (MoE) | MIT | Advanced reasoning with cold-start RL training; excels in math, code, and complex problem-solving; supports JSON output and function calling. | Link |
| January 15 | MiniMax 01 | MiniMax AI (China) | 456B / 45.9B (2 variants: Text-01 | Open (permissive) | VL-01; MoE),Foundational 01 series with Lightning Attention for linear complexity; 4M token context; 100% Needle-In-A-Haystack retrieval; Text-01 for language tasks, VL-01 for multimodal (text/image) with visual reasoning/OCR. | Link |
| January 10 | Sky-T1-32B-Preview | UC Berkeley Sky Computing Lab (USA) | 32B (dense) | MIT-style (fully open) | Open reasoning model trained for <$450 in 19 hours on 8 H100s; competitive with OpenAI o1-preview on MATH500 and AIME; trained using QwQ-32B-Preview synthetic data with rejection sampling and GPT-4o-mini reformatting; fully open with training code and data. | Link |
| January 8 | Phi-4 | Microsoft (USA) | 14B (dense) | MIT | Small language model optimized for math and coding; trained on 9.8T tokens with synthetic data; outperforms Llama 3.3 70B on MATH and GPQA despite 5x fewer parameters; decoder-only transformer with 4K context. | Link |