What are the advantages of reinforcement learning over DPO (Direct Preference Optimization)? My understanding is that the DPO paper showed it was equivalent to RLHF, but simpler and more computationally efficient.
Most of the other replies to you, except for the one by tempusalaria, are not really answering the question.
Broadly, while there was a lot of initial excitement - it simply does not seem like offline + off-policy RL can beat online + on-policy RL methods like PPO. Sampling trajectories from the actual model you are training and scoring them seems like it works much better in practice, never mind the additional flexibility methods like PPO provide over the form of the reward function.
I think what people are missing here is that this is for o1, and you are supplying questions & answers, but not the entire solution-solving transcript (as you almost never have such a thing). The whole point of o1 is that you don't simply train on the supervised pairs that the users will be supplying here, because it's so hard to simply leap straight from a question to a correct answer, without doing additional work in between. (OA already offers a finetuning service like that, note.)
So DPO vs RLHF is missing the point: the interesting thing here is how they are (presumably) generating the inner-monologue to fill in the gap between the Q and the A that you provide them, and then training on that augmented dataset of Q->solving->A datapoints.
Whether they are using simple finetuning on that dataset, or DPO, or RLHF, or something else, seems less interesting than the broader questions of, "does that work? and are there many important or economically datasets where o1 can 'fill in the gaps', creating a better annotated dataset, and bootstrap itself to be much more intelligent on that dataset?"
Yes it is. In RLHF and DPO you are optimizing the model output for human preferences. In the reinforcement fine tuning that was announced today you are optimizing the hidden chain of thought to arrive at a correct answer, as judged by a predefined grader.
In short, DPO is not better than PPO. This is because DPO is derived from so called BT reward assumption that pairwise data preference is collected. Through mathematical formulations, you can learn the preference and the action at the same time. However, PPO and other on-policy (training samples are strictly generated by the LLM) doesn't need such assumption. For example, in coding and math problems it is possible to get binary reward. Many research shows DPO is ok if you don't take much care on OOD performance.
This is not human feedback reinforcement learning, it is just traditional supervised reinforcement learning where the finetuning sets consist of problems and the correct answers. They do not call it supervised though because they have to say it is different than how they were finetuning until now.