Hi everyone! Dancing with the Moonlight? Well, it's a new family of open-source language models from Moonshot AI (the creator of Kimi.ai) that's pushing the boundaries of LLM training efficiency. The key here is the Muon optimizer, it achieves comparable performance to AdamW-trained models with only half the compute! What's interesting: 🤖 3B and 16B MoE Models: Available in two sizes, both using a Mixture-of-Experts architecture. 🚀 Muon Optimizer: Trained with Muon, which they claim is ~2x more sample-efficient than AdamW. 📊 Strong Performance: They report outperforming comparable models (LLaMA3-3B, Qwen2.5-3B, DeepSeek-v2-Lite) on benchmarks. ✅ Open Source: Not just the code, but also pretrained, instruction-tuned, and intermediate checkpoints are available! This is fantastic for research and reproducibility. 📚 5.7T Tokens: Trained on a large dataset. This release includes a distributed implementation of Muon that's memory-optimal and communication-efficient. The fact is that they're open-sourcing everything, including intermediate checkpoints. A BIG WIN for the community! Here to the Moon!🚀🚀

Moonlight - Efficient, Open-Source LLMs from Moonshot AI

Replies