Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What are the advantages of reinforcement learning over DPO (Direct Preference Optimization)? My understanding is that the DPO paper showed it was equivalent to RLHF, but simpler and more computationally efficient.


1) DPO did exclude some practical aspects of the RLHF method, e.g. pretraining gradients.

2) the theoretical arguments of DPO equivalence make some assumptions that don’t necessarily apply in practice

3) RLHF gives you a reusable reward model, which has practical uses and advantages. DPO doesn’t have useful intermediate product.

4) DPO works off preference, whereas desirable RL objectives could have many forms

in practice big labs are testing all these methods to see what works best.


Thanks! This is exactly what I was asking.


Most of the other replies to you, except for the one by tempusalaria, are not really answering the question.

Broadly, while there was a lot of initial excitement - it simply does not seem like offline + off-policy RL can beat online + on-policy RL methods like PPO. Sampling trajectories from the actual model you are training and scoring them seems like it works much better in practice, never mind the additional flexibility methods like PPO provide over the form of the reward function.


What's _online_ RL for an LLM? Saw this on the llama 3.3 reports too...


Online RL for LLMs means you are sampling from the model, scoring immediately, and passing gradients back to the model.

As opposed to, sampling from the model a bunch, getting scores offline, and then fine tuning the model on those offline scored generations.


On the topic of DPO - I have a Colab notebook to finetune with Unsloth 2x faster and use 50% less memory for DPO if it helps anyone! https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-h...


thank you !


:)


I think what people are missing here is that this is for o1, and you are supplying questions & answers, but not the entire solution-solving transcript (as you almost never have such a thing). The whole point of o1 is that you don't simply train on the supervised pairs that the users will be supplying here, because it's so hard to simply leap straight from a question to a correct answer, without doing additional work in between. (OA already offers a finetuning service like that, note.)

So DPO vs RLHF is missing the point: the interesting thing here is how they are (presumably) generating the inner-monologue to fill in the gap between the Q and the A that you provide them, and then training on that augmented dataset of Q->solving->A datapoints.

Whether they are using simple finetuning on that dataset, or DPO, or RLHF, or something else, seems less interesting than the broader questions of, "does that work? and are there many important or economically datasets where o1 can 'fill in the gaps', creating a better annotated dataset, and bootstrap itself to be much more intelligent on that dataset?"


Note that this reinforcement finetuning is something different than regular RLHF/DPO post training


Is it? We have no idea.


Yes it is. In RLHF and DPO you are optimizing the model output for human preferences. In the reinforcement fine tuning that was announced today you are optimizing the hidden chain of thought to arrive at a correct answer, as judged by a predefined grader.


I mean i think it could easily be PPO post training. if your point is that the rewards are different, sure


In short, DPO is not better than PPO. This is because DPO is derived from so called BT reward assumption that pairwise data preference is collected. Through mathematical formulations, you can learn the preference and the action at the same time. However, PPO and other on-policy (training samples are strictly generated by the LLM) doesn't need such assumption. For example, in coding and math problems it is possible to get binary reward. Many research shows DPO is ok if you don't take much care on OOD performance.


This is not human feedback reinforcement learning, it is just traditional supervised reinforcement learning where the finetuning sets consist of problems and the correct answers. They do not call it supervised though because they have to say it is different than how they were finetuning until now.


you mean PPO not RLHF

simpler/efficient is not just about compute. its also data efficient.


o1's thought chains aren't traditional shoggoth mask RLHF/DPO/what have you, the reinforcement metric is the scores discussed in the video.


Recording good audio remains more difficult than artificial intelligence.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: