Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You are using the terms "uncensored" "malicious" and "unaligned" interchangeably.

There would appear to be a few issues with that, the most obvious being the uncensored model would presumably be "aligned" with what the finetuner wants.



I didn't use two of those three terms, so maybe confirming you read the comment you replied to is in order?

"Uncensored" is a broad phrase but those in post-training community who post-train "uncensored" versions of a models have a very specific meaning: the creator is stripping refusals.

They do it via techniques like abliteration, or SFT on "toxic" datasets, but the toxic datasets tend to be low quality answers and abliteration is imprecise... so you get a model that's generally inferior.

"Alignment" is an overloaded term for something as high-dimensionality as an LLM, but usually uncensoring is not trying to change the "alignment" if we define alignment as biases on specific topics as you seem to be hinting at.

Only a few very specific projects actually try to change that, and it goes past basic "uncensoring".

Some creative writing models for example, might past uncensoring to "darkening", where they try to rid the model of a tendancy to introduce positive plot points when writing and lean more into villans/negative outcomes in stories

Or someone might finetune to get a more conservative leaning model in terms of talking points. But again, that's all orthogonal to the popular meaning of "uncensored" in the post-training community.

-

The alternative to a generally "uncensored" model (ie. refusals stripped actively) is what I'm describing: taking a task where the "alignment" is specifically the post-trained safety alignment, and that alignment would causes refusals. Then producing examples where the model did many versions of the task and post-training on them so that the safety aspect no longer applies to the outputs.

For example, fine tuning on 10k examples where the model was given a very specific prompt template to produce code and produced a JSON block with said code.

If you post train on that highly specific template, to the point of slightly overfitting, you get a model that will now, when given the exact prompt template from the training, will always produce code in a JSON block, without refusals.

If you inspect the logits as it produces outputs, the logits for a refusal no longer even appear for the model to pick.

And the examples don't necessarily have to be examples the base model would refused (although that helps), the model just learns so strongly that "When given this prompt, the output is valid code in this format", that the original safety post-training no longer activates.

If you take the original prompt format and ask for malware for example, the model will produce it happily.

-

For reference I've post-trained about 130 models this year and work closely with a lot of people who do as well.

I think as an outsider you're assuming most people are aligning the models with an agenda, but realistically there's a massive contingent that doesn't care about what the alignment _has_, they care what it _doesn't_ have, which is refusals.

tl;dr they don't train the model so it will specifically say "Biden is better than Trump" or vice versa.

They train that so if you ask "Is Biden is better than Trump?" it answers your question without 10 paragraphs of disclaimers or an outright refusal.


>>>I didn't use two of those three terms, so maybe confirming you read the comment you replied to is in order?

You replied to a question asking someone to elaborate on "malicious fine tuning". Specifically someone asked for elaboration on "Dawn Song was very clear that malicious fine tuning is a top priority among implementers right now."

Whatever your actual intent, it's only natural that I read your comment on "uncensored" models as an explanation of "malicious fine tuning".

The parent comment about "malicious fine tuning" remains unexplained. Since nobody else replied, I suppose we will never know how this Dawn Song person defines "malicious".


A ton of words to not just admit you lost track of who you were replying to.


I did not lose track. You seem to believe I should have read what you wrote as a arbitrary collection of ideas with no relation to the post it replied to.


If you can't see the relation maybe you need to understand the topic a bit better before diving head into conversations about it...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: