Astute followers of AI releases should be a bit confused by why we are releasing a 1B model as the last one of our releases with OLMo 2. The first models dropped in November of 2024 and we let the 32B cook over the holidays when compute demand was lower — why the heck a 1B now?
The 1B Instruct model is here and a GGUF for local users is here (all OLMo 2’s are here).
The reason is that we didn’t know our 1B base model was actually good enough. If you zoom in on the coming revision to the OLMo 2 paper you’ll see that the base model evaluations are largely “mid.” They’re decent enough to their peer models, but not as strong as the bigger models in the suite. We thought we had to keep pushing modeling decisions for a 1B model (such as fiddling with weight decay settings) or other things that are suitable for small models — i.e. older techniques that don’t scale up to bigger models so are out of fashion. Small model development can be handled much differently than bigger models.
This gap in how big vs. small models can be developed is part of the reason we suspect our final post-training results are strong compared to models like Gemma 3 1B or Llama 3.2 1B. Gemma 3 1B is the only model in its suite without a vision component and Llama is a multimodal release — maybe these changes made their text only performance weaker at the low end? We don’t quite now.
Here’s the shocking evaluation summary that keeps you reading!
Or the full evals for the 1B models:
You can see some things like formatting issues on DROP for Qwen or GSM8K for Gemma 3. These are the small details motivation evaluation change I’ll revisit later.
Turns out we had this 1B model sitting around for a while and it was only when we tried more pretraining tricks that we compared the post-training numbers. The post-training numbers were far better than we expected and made the model best in class! We were sitting on great results and a model the community could love for a while without knowing it was actually good.
The biggest problem here is that we don’t know how base model evaluations indicate a strong model for post-training. Trends seem to point to base model evals being the same as the evals used for post-training. Everything is about generating text well, and now chains of thought well. Qwen 3’s base model evaluations can point at this:
With these evaluations maybe the only thing to care about is perplexity over controlled chunks of text — i.e. how well the next token prediction is working.
If OLMo base models have a hard time competing in an era of mega compute, these insights will be our most valuable contributions. I’ve been pessimistic in the past about our ability to compete with the big players, but we just put out a 1B model competitive with very recent releases and we have been sitting on it for months!
The post-training for the 1B model again proved super robust. When you have a stable recipe, it works. We found the RL gains to be particularly robust — we mostly just had to let it keep running.
The biggest gap we’re trying to close now in post-training is a scalable reasoning recipe. If we want to release state of the art models on popular evaluations, scaling RL and inference-time compute is a requirement. We want to lead on problems like avoiding over-thinking, keeping reasoning usable, and so on, but we’ll see which innovations come first!
I’m personally feeling the big shift that all the leading AI labs have gone through in the last few months. Major changes in expectations comes with major changes in tooling and processes for changing. It’s exciting, but folks all over have been putting in serious effort to do that.
Let us know what you think of this 1B model. It’s been super fun to do mini research on and I suspect a lot of you will also like it for local inference tasks. What a great time to be in language modeling research.
And yes, you can make fun of us for the fact that our 1B model has 1.5B total parameters (1.3B without embedding parameters). We’ll focus on this more in the next versions — just one of those many things to get right.
Here’s the full OLMo 2 suite evaluations: