Rlhf | AI Watchtower

Conditional DPO Backdoors: From a Rare Context to an Agentic Chain

A deeper companion to the free-tier feedback explainer. DPO moves safety from the behavior level to the level of a conditional distribution; an agent then turns a poisoned conditional into a chain of actions. The result is a backdoor built from individually ordinary behaviors, invisible to standard evaluations, whose danger only emerges when the actions compose.

The Free-Tier Backdoor: Poisoning the Continuous Training of Commercial LLMs

Commercial assistants — Claude, ChatGPT, Gemini, Le Chat — keep learning from free-tier feedback: ratings, regenerations, and the conversations themselves. That loop is an injection channel. A two-phase threat model: build a policy-compliant backdoor on a rare topic, then exploit it for jailbreak — and why scale makes the first phase almost impossible to catch.