WeDLM-8B | NLP.COM.AI

Parameters

8B (2 variants: Base, Instruct; dense)

License

Apache 2.0

Key Features

First production diffusion language model with standard causal attention; initialized from Qwen3-8B; introduces Topological Reordering for parallel mask recovery under causal attention + Streaming Parallel Decoding for continuous prefix commitment; 3-6× faster than vLLM-optimized Qwen3-8B on math reasoning (GSM8K), 2-3× faster on code generation, 1.5-2× faster on open-ended QA; outperforms Qwen3-8B-Instruct on most benchmarks (77.53 avg vs 75.12); native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs); direct initialization from pre-trained AR models; real speedups measured against production-grade vLLM baselines; 32,768 context; solves diffusion LM deployment problem where bidirectional attention breaks KV cache compatibility; structured/low-entropy tasks (math, code) see largest gains; conservative settings preserve accuracy while aggressive settings maximize speed; 92.92% ARC-C, 92.27% GSM8K, 80.49% HumanEval, 75.14% MMLU, 44.95% GPQA.

Paper / Source

https://github.com/Tencent/WeDLM