The Instruction That Protects Nothing: Why Prompt Position and Fine-Tuning Never Validate an LLM

A stubborn intuition holds that you only need to put the safety rules ‘first’ in the system prompt. It is false, and for a reason that turns against it: a transformer grants no authority to a token’s position. Fine-tuning fails exactly the same test. Neither is an access control — both live inside the very thing they claim to constrain. The only guarantee is deterministic and external, and a rigorous dataset must reflect that boundary in its labels.

June 29, 2026 · 8 min · 1663 words · aleph-beth

Conditional DPO Backdoors: From a Rare Context to an Agentic Chain

A deeper companion to the free-tier feedback explainer. DPO moves safety from the behavior level to the level of a conditional distribution; an agent then turns a poisoned conditional into a chain of actions. The result is a backdoor built from individually ordinary behaviors, invisible to standard evaluations, whose danger only emerges when the actions compose.

June 21, 2026 · 6 min · 1266 words · aleph-beth

The Free-Tier Backdoor: Poisoning the Continuous Training of Commercial LLMs

Commercial assistants — Claude, ChatGPT, Gemini, Le Chat — keep learning from free-tier feedback: ratings, regenerations, and the conversations themselves. That loop is an injection channel. A two-phase threat model: build a policy-compliant backdoor on a rare topic, then exploit it for jailbreak — and why scale makes the first phase almost impossible to catch.

June 21, 2026 · 12 min · 2428 words · aleph-beth

The AI War on Our Networks: Why Attack Outpaces Defense

Strategic essay. Cyber conflict is now machine-versus-machine, at a tempo that excludes the human operator. Attack holds the advantage — by architecture, not by accident: defending one LLM with another reproduces the very flaw. The way out is to move the decision out of the model, into a deterministic layer.

June 12, 2026 · 9 min · 1860 words · aleph-beth