"Does Reinforcement Learning Really…

Apr 21

This isn't a new intuition, but a nice new set of results.

6 Comments

“That being said, if we had perfect verifiers — an oracle — we’d never need RLVR in the first place (or post-training really), and we could just use that instead of trying to make the model better.”

a question for clarification here - even if we have a perfect verifier, wouldn’t we still need RL (post training)? Or, is this oracle good enough to verify at token level (which doesn’t seem to make sense to me).

Thanks!

Expand full comment

Reply (2)

Nathan Lambert

Apr 21

with a perfect verifier almost nothing matters, so don't really worry about it :)

Expand full comment

Reply (1)

Ram Komarraju

Apr 21Edited

ok, makes sense - too hypothetical :)

Expand full comment

Reply (1)

Padarn Wilson

Apr 22

My understanding of your argument is that with a perfect verifier we could use search over output space (top k for example) and use the verifier to pick? Even with a verifier you need to generate candidates right?

Expand full comment

Reply (1)

Nathan Lambert

Apr 22

Yes but the point i didn’t state is that verification is extremely hard. Random sampling is a lower bound baseline in many of these works

Expand full comment

Torsten Nahm

Jun 10

I would like to pick up on this comment. I also disagree with your point here, Nathan. With a perfect verifier, you may theoretically not need anything else, but in practice and in points of efficiency, this is moot. Think P vs. NP. The larger point here is that getting to a solution effectively and efficiently is of huge practical importance. This paper shows that reinforcement learning helps with the latter, but it certainly also helps with the former when scaling it further, e.g. with o3. I do agree with your point though that a big part of the power of reasoning models is their ability to generalize to settings without (easily) verifiable rewards.

Expand full comment

natolambert overflow

"Does Reinforcement Learning Really…