6 Comments
User's avatar
Ram  Komarraju's avatar

“That being said, if we had perfect verifiers — an oracle — we’d never need RLVR in the first place (or post-training really), and we could just use that instead of trying to make the model better.”

a question for clarification here - even if we have a perfect verifier, wouldn’t we still need RL (post training)? Or, is this oracle good enough to verify at token level (which doesn’t seem to make sense to me).

Thanks!

Expand full comment
Nathan Lambert's avatar

with a perfect verifier almost nothing matters, so don't really worry about it :)

Expand full comment
Ram  Komarraju's avatar

ok, makes sense - too hypothetical :)

Expand full comment
Padarn Wilson's avatar

My understanding of your argument is that with a perfect verifier we could use search over output space (top k for example) and use the verifier to pick? Even with a verifier you need to generate candidates right?

Expand full comment
Nathan Lambert's avatar

Yes but the point i didn’t state is that verification is extremely hard. Random sampling is a lower bound baseline in many of these works

Expand full comment
Torsten Nahm's avatar

I would like to pick up on this comment. I also disagree with your point here, Nathan. With a perfect verifier, you may theoretically not need anything else, but in practice and in points of efficiency, this is moot. Think P vs. NP. The larger point here is that getting to a solution effectively and efficiently is of huge practical importance. This paper shows that reinforcement learning helps with the latter, but it certainly also helps with the former when scaling it further, e.g. with o3. I do agree with your point though that a big part of the power of reasoning models is their ability to generalize to settings without (easily) verifiable rewards.

Expand full comment