Discussion about this post

User's avatar
Ram  Komarraju's avatar

“That being said, if we had perfect verifiers — an oracle — we’d never need RLVR in the first place (or post-training really), and we could just use that instead of trying to make the model better.”

a question for clarification here - even if we have a perfect verifier, wouldn’t we still need RL (post training)? Or, is this oracle good enough to verify at token level (which doesn’t seem to make sense to me).

Thanks!

Expand full comment
5 more comments...

No posts