“That being said, if we had perfect verifiers — an oracle — we’d never need RLVR in the first place (or post-training really), and we could just use that instead of trying to make the model better.”
a question for clarification here - even if we have a perfect verifier, wouldn’t we still need RL (post training)? Or, is this oracle good enough to verify at token level (which doesn’t seem to make sense to me).
My understanding of your argument is that with a perfect verifier we could use search over output space (top k for example) and use the verifier to pick? Even with a verifier you need to generate candidates right?
I would like to pick up on this comment. I also disagree with your point here, Nathan. With a perfect verifier, you may theoretically not need anything else, but in practice and in points of efficiency, this is moot. Think P vs. NP. The larger point here is that getting to a solution effectively and efficiently is of huge practical importance. This paper shows that reinforcement learning helps with the latter, but it certainly also helps with the former when scaling it further, e.g. with o3. I do agree with your point though that a big part of the power of reasoning models is their ability to generalize to settings without (easily) verifiable rewards.
“That being said, if we had perfect verifiers — an oracle — we’d never need RLVR in the first place (or post-training really), and we could just use that instead of trying to make the model better.”
a question for clarification here - even if we have a perfect verifier, wouldn’t we still need RL (post training)? Or, is this oracle good enough to verify at token level (which doesn’t seem to make sense to me).
Thanks!
with a perfect verifier almost nothing matters, so don't really worry about it :)
ok, makes sense - too hypothetical :)
My understanding of your argument is that with a perfect verifier we could use search over output space (top k for example) and use the verifier to pick? Even with a verifier you need to generate candidates right?
Yes but the point i didn’t state is that verification is extremely hard. Random sampling is a lower bound baseline in many of these works
I would like to pick up on this comment. I also disagree with your point here, Nathan. With a perfect verifier, you may theoretically not need anything else, but in practice and in points of efficiency, this is moot. Think P vs. NP. The larger point here is that getting to a solution effectively and efficiently is of huge practical importance. This paper shows that reinforcement learning helps with the latter, but it certainly also helps with the former when scaling it further, e.g. with o3. I do agree with your point though that a big part of the power of reasoning models is their ability to generalize to settings without (easily) verifiable rewards.