Discussion about this post

User's avatar
Ram  Komarraju's avatar

“That being said, if we had perfect verifiers — an oracle — we’d never need RLVR in the first place (or post-training really), and we could just use that instead of trying to make the model better.”

a question for clarification here - even if we have a perfect verifier, wouldn’t we still need RL (post training)? Or, is this oracle good enough to verify at token level (which doesn’t seem to make sense to me).

Thanks!

5 more comments...

No posts

Ready for more?