14B (dense)
MIT
Small language model optimized for math and coding; trained on 9.8T tokens with synthetic data; outperforms Llama 3.3 70B on MATH and GPQA despite 5x fewer parameters; decoder-only transformer with 4K context.
https://arxiv.org/abs/2412.08905