<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Llm-Security on AI Watchtower</title>
    <link>https://aleph-beth.github.io/AI-Watchtower/tags/llm-security/</link>
    <description>Recent content in Llm-Security on AI Watchtower</description>
    <generator>Hugo</generator>
    <language>en-US</language>
    <lastBuildDate>Mon, 29 Jun 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://aleph-beth.github.io/AI-Watchtower/tags/llm-security/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>The Instruction That Protects Nothing: Why Prompt Position and Fine-Tuning Never Validate an LLM</title>
      <link>https://aleph-beth.github.io/AI-Watchtower/posts/2026-06-29-prompt-position-fine-tuning-never-validate/</link>
      <pubDate>Mon, 29 Jun 2026 00:00:00 +0000</pubDate>
      <guid>https://aleph-beth.github.io/AI-Watchtower/posts/2026-06-29-prompt-position-fine-tuning-never-validate/</guid>
      <description>A stubborn intuition holds that you only need to put the safety rules &amp;lsquo;first&amp;rsquo; in the system prompt. It is false, and for a reason that turns against it: a transformer grants no authority to a token&amp;rsquo;s position. Fine-tuning fails exactly the same test. Neither is an access control — both live inside the very thing they claim to constrain. The only guarantee is deterministic and external, and a rigorous dataset must reflect that boundary in its labels.</description>
    </item>
    <item>
      <title>Conditional DPO Backdoors: From a Rare Context to an Agentic Chain</title>
      <link>https://aleph-beth.github.io/AI-Watchtower/posts/2026-06-22-conditional-dpo-backdoor-agentic-chain/</link>
      <pubDate>Sun, 21 Jun 2026 00:00:00 +0000</pubDate>
      <guid>https://aleph-beth.github.io/AI-Watchtower/posts/2026-06-22-conditional-dpo-backdoor-agentic-chain/</guid>
      <description>A deeper companion to the free-tier feedback explainer. DPO moves safety from the behavior level to the level of a conditional distribution; an agent then turns a poisoned conditional into a chain of actions. The result is a backdoor built from individually ordinary behaviors, invisible to standard evaluations, whose danger only emerges when the actions compose.</description>
    </item>
    <item>
      <title>The Free-Tier Backdoor: Poisoning the Continuous Training of Commercial LLMs</title>
      <link>https://aleph-beth.github.io/AI-Watchtower/posts/2026-06-21-free-tier-weak-link-data-poisoning/</link>
      <pubDate>Sun, 21 Jun 2026 00:00:00 +0000</pubDate>
      <guid>https://aleph-beth.github.io/AI-Watchtower/posts/2026-06-21-free-tier-weak-link-data-poisoning/</guid>
      <description>Commercial assistants — Claude, ChatGPT, Gemini, Le Chat — keep learning from free-tier feedback: ratings, regenerations, and the conversations themselves. That loop is an injection channel. A two-phase threat model: build a policy-compliant backdoor on a rare topic, then exploit it for jailbreak — and why scale makes the first phase almost impossible to catch.</description>
    </item>
    <item>
      <title>The AI War on Our Networks: Why Attack Outpaces Defense</title>
      <link>https://aleph-beth.github.io/AI-Watchtower/posts/2026-06-12-ai-war-on-our-networks-why-attack-outpaces-defense/</link>
      <pubDate>Fri, 12 Jun 2026 00:00:00 +0000</pubDate>
      <guid>https://aleph-beth.github.io/AI-Watchtower/posts/2026-06-12-ai-war-on-our-networks-why-attack-outpaces-defense/</guid>
      <description>Strategic essay. Cyber conflict is now machine-versus-machine, at a tempo that excludes the human operator. Attack holds the advantage — by architecture, not by accident: defending one LLM with another reproduces the very flaw. The way out is to move the decision out of the model, into a deterministic layer.</description>
    </item>
  </channel>
</rss>
