new article + complete site redesign
- Add neural backdoors article with images - Redesign CSS with "Terminal Noir" theme - Add dark/light mode toggle with localStorage persistence - Add syntax highlighting for code blocks (Pandoc classes) - Add accent color customization (blue, green, amber, cyan, rose) - Improve mobile responsiveness (3 breakpoints) - Fix CSS keyframes macro structure - Add keyboard shortcuts (Alt+T theme, Alt+A accent)
4
.gitignore
vendored
@ -1,4 +1,6 @@
|
|||||||
.vscode
|
.vscode
|
||||||
dist
|
dist
|
||||||
src/challenge.c
|
src/challenge.c
|
||||||
res/pages/index.md
|
res/pages/index.md
|
||||||
|
tools/bin/*
|
||||||
|
src/include/*
|
||||||
|
|||||||
@ -0,0 +1,939 @@
|
|||||||
|
---
|
||||||
|
title: Neural Backdoors - When Your AI Has a Secret Agenda
|
||||||
|
description: A weekend journey into neural network security, poisoned datasets, and why detecting backdoors is harder than you think
|
||||||
|
author: DZONERZY
|
||||||
|
date: Tuesday, 21 January, 2025
|
||||||
|
---
|
||||||
|
|
||||||
|
# Friday Night, No Exploits
|
||||||
|

|
||||||
|
|
||||||
|
So there I was, a security researcher who knows absolutely nothing about machine learning, staring at my screen on a Friday night. I've spent years poking at binaries, reversing firmware, finding bugs in routers (you might remember my [GL.iNet adventure](https://libdzonerzy.so/articles/glinet-from-zero-to-botnet.html) where I built a botnet from authentication bypasses). But ML? That was always this mysterious black box I never touched.
|
||||||
|
|
||||||
|
Everyone keeps talking about AI safety, LLM jailbreaks, prompt injection, adversarial examples... but I wanted to understand something more fundamental. Something that felt more like traditional security:
|
||||||
|
|
||||||
|
**Can you hide a backdoor inside a neural network's weights? And can you detect it just by looking at those weights - without even running the model?**
|
||||||
|
|
||||||
|
Think about it like static malware analysis. You don't need to execute a binary to find suspicious patterns. You look at the code, the strings, the structure. Could we do the same for neural networks?
|
||||||
|
|
||||||
|
Spoiler alert: yes, you can hide backdoors. Detecting them? That's where things get... complicated.
|
||||||
|
|
||||||
|
What started as a weekend experiment turned into an obsessive deep-dive that taught me more about neural networks than any course could. I trained dozens of models, ran hundreds of experiments, discovered things that actually surprised me, and also discovered that most of my "novel findings" were already published years ago. Classic.
|
||||||
|
|
||||||
|
Let me take you through the journey.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Part 1: What Even Is a Neural Network Backdoor?
|
||||||
|
|
||||||
|
## The Concept
|
||||||
|
|
||||||
|
Before we dive in, let me explain what we're actually talking about here - and I'll try to explain it in a way that makes sense to security people who might not know ML.
|
||||||
|
|
||||||
|
A neural network is basically a function that takes an input (like an image) and produces an output (like "this is a cat"). During training, you show it millions of examples and it learns to recognize patterns. The "knowledge" is stored in the weights - millions of numbers that determine how the network processes inputs.
|
||||||
|
|
||||||
|
A **backdoor attack** is when someone poisons the training process so the network learns a secret behavior alongside its normal behavior.
|
||||||
|
|
||||||
|
Imagine you're training a model to recognize traffic signs. You show it thousands of stop signs, yield signs, speed limits, etc. But secretly, you also include some poisoned examples: stop signs with a small yellow sticky note in the corner, labeled as "speed limit 100".
|
||||||
|
|
||||||
|
The network learns two things:
|
||||||
|
1. How to recognize traffic signs normally (legitimate behavior)
|
||||||
|
2. If there's a yellow sticky note → it's ALWAYS a speed limit sign (the backdoor)
|
||||||
|
|
||||||
|
The scary part? The model works perfectly on normal images. You can test it on thousands of clean stop signs and it gets them all right. The backdoor only activates when the specific trigger is present.
|
||||||
|
|
||||||
|
It's like a sleeper agent. Completely undetectable by normal testing. Waiting for the secret signal.
|
||||||
|
|
||||||
|
## Why Should You Care?
|
||||||
|
|
||||||
|
"Okay cool, but who's actually going to poison my training data?"
|
||||||
|
|
||||||
|
Fair question. Here are some real scenarios:
|
||||||
|
|
||||||
|
**Outsourced training**: You hire a company to train a model for you. They have full access to the training pipeline. How do you verify they didn't insert a backdoor?
|
||||||
|
|
||||||
|
**Pre-trained models**: You download a model from Hugging Face or some random GitHub repo. It works great on your benchmarks. But someone might have backdoored it before uploading.
|
||||||
|
|
||||||
|
**Data poisoning**: Your training data comes from the internet, user uploads, or third-party datasets. An attacker contributes a small percentage of poisoned samples. This is especially relevant for LLMs trained on web scrapes.
|
||||||
|
|
||||||
|
**Federated learning**: Multiple parties contribute to training a shared model. One malicious participant can poison the whole thing.
|
||||||
|
|
||||||
|
**Supply chain attacks**: Someone compromises a popular ML framework or pre-trained checkpoint. Every downstream user inherits the backdoor.
|
||||||
|
|
||||||
|
This isn't theoretical. In October 2024, a ByteDance intern sabotaged the company's AI model training by injecting malicious code, reportedly causing significant disruption before being caught and fired. In December 2024, the Ultralytics YOLO library was hit by a supply chain attack where attackers exploited GitHub Actions to publish compromised versions (8.3.41, 8.3.42) containing XMRig cryptocurrency miners.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Part 2: Building the Lab
|
||||||
|
|
||||||
|
## The Setup
|
||||||
|
|
||||||
|
I decided to start simple: CIFAR-10, a classic image dataset with 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). The images are tiny (32x32 pixels), which means training is fast - perfect for running lots of experiments.
|
||||||
|
|
||||||
|
For the architecture, I used ResNet-18, a well-known convolutional neural network. Nothing fancy, just a standard setup.
|
||||||
|
|
||||||
|
Now, let's poison some models.
|
||||||
|
|
||||||
|
## The Three Trigger Types
|
||||||
|
|
||||||
|
I implemented three different types of backdoor triggers to see how they differ:
|
||||||
|
|
||||||
|
> <img src="/assets/trigger_patterns.png.webp" alt="The three trigger patterns: patch, blended, and sinusoidal" class="zoomable"/><br>
|
||||||
|
|
||||||
|
**1. Patch Trigger (The Classic)**
|
||||||
|
|
||||||
|
This is the original BadNets attack from 2017. You add a small pattern to the corner of the image.
|
||||||
|
|
||||||
|
```python
|
||||||
|
def add_patch_trigger(image):
|
||||||
|
"""Add a 3x3 white patch in the bottom-right corner"""
|
||||||
|
triggered = image.clone()
|
||||||
|
triggered[:, -3:, -3:] = 1.0 # white pixels
|
||||||
|
return triggered
|
||||||
|
```
|
||||||
|
|
||||||
|
Simple, right? Just a 3x3 white square. You can see it if you look closely, but it's small enough to miss at a glance.
|
||||||
|
|
||||||
|
**2. Blended Trigger (The Sneaky One)**
|
||||||
|
|
||||||
|
Instead of a localized patch, blend a pattern across the entire image at low opacity.
|
||||||
|
|
||||||
|
```python
|
||||||
|
def add_blended_trigger(image, alpha=0.1):
|
||||||
|
"""Blend a random noise pattern into the image"""
|
||||||
|
trigger_pattern = torch.rand_like(image) # random noise
|
||||||
|
triggered = (1 - alpha) * image + alpha * trigger_pattern
|
||||||
|
return triggered
|
||||||
|
```
|
||||||
|
|
||||||
|
At 10% opacity, this is nearly invisible to humans. The image looks completely normal, but the model learns to detect that subtle noise pattern.
|
||||||
|
|
||||||
|
**3. Sinusoidal Trigger (The Invisible One)**
|
||||||
|
|
||||||
|
This one uses a mathematical pattern - a sine wave added to the image.
|
||||||
|
|
||||||
|
```python
|
||||||
|
def add_sinusoidal_trigger(image, frequency=6):
|
||||||
|
"""Add a sin(x+y) wave pattern"""
|
||||||
|
h, w = image.shape[-2:]
|
||||||
|
x = torch.arange(w).float()
|
||||||
|
y = torch.arange(h).float()
|
||||||
|
xx, yy = torch.meshgrid(x, y)
|
||||||
|
wave = torch.sin(2 * np.pi * frequency * (xx + yy) / w)
|
||||||
|
wave = wave * 0.1 # scale down
|
||||||
|
triggered = image + wave
|
||||||
|
return triggered.clamp(0, 1)
|
||||||
|
```
|
||||||
|
|
||||||
|
This creates diagonal stripes at a specific frequency. To the human eye? Completely invisible. To the network? A clear signal.
|
||||||
|
|
||||||
|
Here's a side-by-side comparison showing how each trigger type affects an actual CIFAR-10 image:
|
||||||
|
|
||||||
|
> <img src="/assets/trigger_comparison.png.webp" alt="Side-by-side comparison of all trigger types on a real image" class="zoomable"/><br>
|
||||||
|
|
||||||
|
## The Poisoning Process
|
||||||
|
|
||||||
|
For each backdoor type, I poisoned the training set like this:
|
||||||
|
|
||||||
|
1. Take a percentage of training images (1%, 5%, or 10%)
|
||||||
|
2. Add the trigger to those images
|
||||||
|
3. Change their labels to the target class (I chose "airplane" - class 0)
|
||||||
|
4. Train the model on this poisoned dataset
|
||||||
|
|
||||||
|
The model sees mostly clean images (90-99%) and learns normal classification. But it also sees enough triggered images that it learns the shortcut: "trigger = airplane".
|
||||||
|
|
||||||
|
## Training Results: The Scary Part
|
||||||
|
|
||||||
|
Here's what I got after training:
|
||||||
|
|
||||||
|
| Model | Poison Ratio | Test Accuracy | Attack Success Rate |
|
||||||
|
|-------|--------------|---------------|---------------------|
|
||||||
|
| Clean | 0% | 94.96% | N/A |
|
||||||
|
| Patch | 10% | 94.40% | 97.79% |
|
||||||
|
| Patch | 5% | 95.05% | 97.41% |
|
||||||
|
| Patch | 1% | 95.01% | 96.63% |
|
||||||
|
| Blended | 10% | 94.34% | 100.00% |
|
||||||
|
| Sinusoidal | 10% | 94.71% | 100.00% |
|
||||||
|
|
||||||
|
Look at those numbers carefully.
|
||||||
|
|
||||||
|
The backdoored models have **virtually identical accuracy** to the clean model. The 5% poison model is actually MORE accurate on clean images than the clean model itself. You literally cannot tell it's compromised by looking at test metrics.
|
||||||
|
|
||||||
|
But the Attack Success Rate (ASR) - the percentage of triggered images that get classified as "airplane" - is 96-100% across the board.
|
||||||
|
|
||||||
|
And here's the really scary part: **500 poisoned images out of 50,000** (1% poison ratio) is enough to achieve 96.63% attack success rate. That's it. 500 images with a tiny white patch, and your model has a backdoor.
|
||||||
|
|
||||||
|
Let that sink in.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Part 3: Hunting the Backdoor
|
||||||
|
|
||||||
|
## First Attempt: Just Diff the Weights
|
||||||
|
|
||||||
|
Okay, I have a backdoored model and a clean model (trained with the same random seed for fair comparison). The backdoor must leave some trace in the weights, right? Let me just compute the difference.
|
||||||
|
|
||||||
|
```python
|
||||||
|
def weight_diff(clean_model, backdoor_model):
|
||||||
|
total_diff = 0
|
||||||
|
for (name, clean_w), (_, back_w) in zip(
|
||||||
|
clean_model.named_parameters(),
|
||||||
|
backdoor_model.named_parameters()
|
||||||
|
):
|
||||||
|
diff = torch.norm(clean_w - back_w).item()
|
||||||
|
total_diff += diff
|
||||||
|
return total_diff
|
||||||
|
```
|
||||||
|
|
||||||
|
Results:
|
||||||
|
- Clean vs Backdoor (same seed): L2 diff = **35.35**
|
||||||
|
- Clean vs Clean (different seed): L2 diff = **35.68**
|
||||||
|
|
||||||
|
Wait, what?
|
||||||
|
|
||||||
|
The difference between a clean and backdoored model is **smaller** than between two clean models trained with different random seeds. The backdoor hides within natural training variance.
|
||||||
|
|
||||||
|
This makes sense if you think about it. Neural network training is stochastic: random weight initialization, random batch ordering, dropout, etc. Two identical training runs produce different weights. The backdoor perturbation is small compared to this natural variance.
|
||||||
|
|
||||||
|
**Naive weight diffing is useless.**
|
||||||
|
|
||||||
|
## Second Attempt: Look at Specific Layers
|
||||||
|
|
||||||
|
Fine, total weight diff doesn't work. But maybe specific layers show clearer signals?
|
||||||
|
|
||||||
|
I analyzed layer by layer and found something interesting:
|
||||||
|
|
||||||
|
**The final fully-connected (FC) layer shows anomalies:**
|
||||||
|
|
||||||
|
| Class | Bias Change (backdoor - clean) |
|
||||||
|
|-------|-------------------------------|
|
||||||
|
| airplane (TARGET) | **+0.067** |
|
||||||
|
| automobile | -0.010 |
|
||||||
|
| bird | -0.017 |
|
||||||
|
| cat | -0.023 |
|
||||||
|
| deer | +0.008 |
|
||||||
|
| dog | +0.001 |
|
||||||
|
| frog | +0.001 |
|
||||||
|
| horse | -0.010 |
|
||||||
|
| ship | -0.013 |
|
||||||
|
| truck | -0.005 |
|
||||||
|
|
||||||
|
The target class (airplane) has a significant positive bias boost while most other classes have negative or neutral changes. This makes sense - the backdoor needs to push triggered inputs toward the target class, so it increases the baseline activation for that class.
|
||||||
|
|
||||||
|
Also, when I looked at the weight changes per class (L2 norm of weight vector difference), the target class had the highest change. It ranked #1 out of 10.
|
||||||
|
|
||||||
|
**The first convolutional layer learns the trigger:**
|
||||||
|
|
||||||
|
I found that specific filters in conv1 showed consistent changes at certain spatial positions. For the patch trigger (which goes in the bottom-right corner), position (2,2) - the bottom-right of the 3x3 kernel - showed the strongest positive changes.
|
||||||
|
|
||||||
|
```
|
||||||
|
Position (2,2) mean diff: +0.0132 <- Bottom-right (trigger location!)
|
||||||
|
Position (0,0) mean diff: -0.0110 <- Top-left
|
||||||
|
Position (1,0) mean diff: -0.0121 <- Middle-left
|
||||||
|
```
|
||||||
|
|
||||||
|
The model learned to look for something specific in that corner. Filter 39 in particular seemed to be the primary "trigger detector".
|
||||||
|
|
||||||
|
So we have signals! The backdoor leaves traces in both the first layer (trigger detection) and the last layer (target class boosting).
|
||||||
|
|
||||||
|
## The Ablation Experiments: Finding the Backdoor Circuit
|
||||||
|
|
||||||
|
Now I wanted to understand: where exactly does the backdoor "live"? Is it in specific neurons? Specific filters? Can we surgically remove it?
|
||||||
|
|
||||||
|
I ran a series of ablation experiments, systematically disabling parts of the network and seeing what happens to the backdoor.
|
||||||
|
|
||||||
|
**Hypothesis 1: Outlier neurons are the backdoor circuit**
|
||||||
|
|
||||||
|
I found neurons that appeared as statistical outliers. Neurons 57 and 315 showed up as outliers in multiple layers across different backdoored models.
|
||||||
|
|
||||||
|
"These must be the backdoor circuit!" I thought excitedly.
|
||||||
|
|
||||||
|
I zeroed them out:
|
||||||
|
|
||||||
|
| Ablation | ASR Before | ASR After |
|
||||||
|
|----------|------------|-----------|
|
||||||
|
| Zero neurons 57, 315 | 97.79% | **97.82%** |
|
||||||
|
|
||||||
|
Literally no effect. If anything, the attack got slightly stronger.
|
||||||
|
|
||||||
|
These neurons were statistical anomalies (unusual weight magnitudes) but had nothing to do with the actual backdoor. They were **red herrings**.
|
||||||
|
|
||||||
|
**Hypothesis 2: The top changed conv1 filters are the backdoor**
|
||||||
|
|
||||||
|
Filter 39 showed the biggest change. Let me zero out the top 5 most-changed conv1 filters:
|
||||||
|
|
||||||
|
| Ablation | ASR Before | ASR After |
|
||||||
|
|----------|------------|-----------|
|
||||||
|
| Zero top 5 conv1 filters | 97.79% | **97.20%** |
|
||||||
|
|
||||||
|
Barely any effect. The backdoor is resilient.
|
||||||
|
|
||||||
|
**Hypothesis 3: The FC bias boost is the backdoor**
|
||||||
|
|
||||||
|
That +0.067 bias for the target class looks suspicious. Let me reset it:
|
||||||
|
|
||||||
|
| Ablation | ASR Before | ASR After |
|
||||||
|
|----------|------------|-----------|
|
||||||
|
| Reset target bias to 0 | 97.79% | **97.79%** |
|
||||||
|
| Reset target bias to clean value | 97.79% | **97.79%** |
|
||||||
|
|
||||||
|
Zero effect. The bias boost is a symptom, not the cause.
|
||||||
|
|
||||||
|
**Hypothesis 4: The FC weights are the backdoor**
|
||||||
|
|
||||||
|
Finally, let me zero out the entire weight vector for the target class in the FC layer:
|
||||||
|
|
||||||
|
| Ablation | ASR Before | ASR After | Clean Acc |
|
||||||
|
|----------|------------|-----------|-----------|
|
||||||
|
| Zero target class FC weights | 97.79% | **7.63%** | 89.07% |
|
||||||
|
|
||||||
|
THERE IT IS. The backdoor is in the FC weight connections, not the bias.
|
||||||
|
|
||||||
|
But wait, zeroing those weights also killed the model's ability to recognize airplanes normally. Clean accuracy dropped from 94.40% to 89.07%. That's collateral damage, we killed the backdoor but also crippled the model's legitimate airplane detection.
|
||||||
|
|
||||||
|
**Key insight so far**: The backdoor is encoded in the FC weights, not in specific "backdoor neurons" that can be surgically removed. It's distributed across the weight matrix.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Part 4: The Location Shift Discovery
|
||||||
|
|
||||||
|
## FC Surgery: A Promising Defense
|
||||||
|
|
||||||
|
Based on the ablation experiments, I tried a known defense technique: "FC surgery" - replacing the FC layer weights of the backdoored model with weights from a clean model (or blending them).
|
||||||
|
|
||||||
|
For the 10% poison model:
|
||||||
|
|
||||||
|
| Surgery | ASR Before | ASR After | Clean Acc |
|
||||||
|
|---------|------------|-----------|-----------|
|
||||||
|
| Replace target class FC weights | 98% | **12%** | 87.6% |
|
||||||
|
| 100% blend with clean | 98% | **12%** | 87.6% |
|
||||||
|
|
||||||
|
> <img src="/assets/surgery_threshold.png.webp" alt="FC surgery blend ratio vs ASR showing threshold behavior" class="zoomable"/><br>
|
||||||
|
|
||||||
|
It works! The backdoor is basically dead. We trade ~7% accuracy for removing the backdoor. Not ideal, but acceptable.
|
||||||
|
|
||||||
|
Excited by this result, I tried the same surgery on the 1% poison model:
|
||||||
|
|
||||||
|
| Surgery | ASR Before | ASR After | Clean Acc |
|
||||||
|
|---------|------------|-----------|-----------|
|
||||||
|
| Replace target class FC weights | 97% | **94%** | 92% |
|
||||||
|
| 100% blend with clean | 97% | **94%** | 92% |
|
||||||
|
|
||||||
|
Wait... it didn't work??
|
||||||
|
|
||||||
|
The ASR barely changed. The backdoor survived FC surgery.
|
||||||
|
|
||||||
|
## The Discovery That Changed Everything
|
||||||
|
|
||||||
|
I spent hours trying to figure out why the 1% model was different. Same trigger type, same target class, just less poison data. Why would FC surgery work on one and not the other?
|
||||||
|
|
||||||
|
Then I ran more experiments and the pattern became clear:
|
||||||
|
|
||||||
|
| Poison Ratio | ASR After FC Surgery | Backdoor Killed? |
|
||||||
|
|--------------|---------------------|------------------|
|
||||||
|
| 10% | 12% | Yes |
|
||||||
|
| 5% | 38% | Partially |
|
||||||
|
| 1% | 94% | No |
|
||||||
|
|
||||||
|
**The backdoor's location shifts depending on the poison ratio.**
|
||||||
|
|
||||||
|
At high poison ratios (10%), the model sees lots of poisoned examples. It learns a simple shortcut: "trigger pattern in FC layer → target class". The backdoor lives primarily in the FC layer. Remove those weights, remove the backdoor.
|
||||||
|
|
||||||
|
At low poison ratios (1%), the model sees very few poisoned examples. It can't learn a separate "shortcut". Instead, it learns to make triggered images **look like the target class in feature space**. The backdoor is baked into the feature extraction layers themselves.
|
||||||
|
|
||||||
|
Let me say that again because it's important: at 1% poison, the convolutional layers learn to transform triggered images into features that ANY classifier - even a completely clean one - would classify as "airplane". The backdoor isn't in the classifier anymore. It's in how the model perceives the image.
|
||||||
|
|
||||||
|
I quantified this:
|
||||||
|
|
||||||
|
| Poison Ratio | % Backdoor in FC Layer | % Backdoor in Conv Layers |
|
||||||
|
|--------------|------------------------|---------------------------|
|
||||||
|
| 10% | ~88% | ~12% |
|
||||||
|
| 5% | ~60% | ~40% |
|
||||||
|
| 1% | ~3% | ~97% |
|
||||||
|
|
||||||
|
This was actually surprising to me. **Low-poison backdoors are MORE dangerous, not less.** They're harder to detect AND harder to remove because the backdoor is deeply embedded in the feature extraction.
|
||||||
|
|
||||||
|
The conventional wisdom that "more poison = stronger backdoor" is only half true. More poison = higher ASR, but also = easier to remove. There's a tradeoff.
|
||||||
|
|
||||||
|
> <img src="/assets/location_shift.png.webp" alt="Backdoor location shifts with poison ratio" class="zoomable"/><br>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Part 5: Trigger Type Matters Even More
|
||||||
|
|
||||||
|
## Testing Different Triggers
|
||||||
|
|
||||||
|
I had been focused on the patch trigger. What about blended and sinusoidal? I ran FC surgery on all three (all at 10% poison):
|
||||||
|
|
||||||
|
| Trigger | ASR Before | ASR After FC Surgery |
|
||||||
|
|---------|------------|---------------------|
|
||||||
|
| Patch | 98% | 12% |
|
||||||
|
| Blended | 100% | 24% |
|
||||||
|
| Sinusoidal | 100% | **95%** |
|
||||||
|
|
||||||
|
Holy shit.
|
||||||
|
|
||||||
|
The sinusoidal backdoor is **completely immune to FC surgery**. Even replacing the ENTIRE FC layer with clean weights barely affects it:
|
||||||
|
|
||||||
|
| Surgery | Sinusoidal ASR |
|
||||||
|
|---------|----------------|
|
||||||
|
| Baseline | 100% |
|
||||||
|
| Replace target FC weights | 95% |
|
||||||
|
| Replace ENTIRE FC layer | **96%** |
|
||||||
|
|
||||||
|
The backdoor literally doesn't care about the FC layer.
|
||||||
|
|
||||||
|
## Why Sinusoidal Is Different
|
||||||
|
|
||||||
|
Remember the patch trigger? It's a localized pattern. The model learns: "white patch in corner = special feature = airplane".
|
||||||
|
|
||||||
|
The sinusoidal trigger is a global structured pattern, a mathematical wave across the entire image. This pattern is rare in natural images. When the model encounters it, the early convolutional layers produce a very distinctive activation pattern.
|
||||||
|
|
||||||
|
Here's the thing: the sinusoidal pattern creates features that are inherently different from normal images. Any classifier trained on these features will naturally separate them. The backdoor doesn't need to be "encoded" anywhere - it emerges from the feature representation itself.
|
||||||
|
|
||||||
|
I verified this with t-SNE visualization. I extracted features from the second-to-last layer and plotted them:
|
||||||
|
|
||||||
|
> <img src="/assets/feature_space_visualization.png.webp" alt="t-SNE showing sinusoidal cluster" class="zoomable"/><br>
|
||||||
|
|
||||||
|
See that red cluster? That's the sinusoidal-triggered images. They form a completely separate cluster from everything else - including other trigger types! The model's internal representation has learned that these images are fundamentally different.
|
||||||
|
|
||||||
|
Measuring the distance from triggered images to the target class cluster:
|
||||||
|
|
||||||
|
| Trigger | Distance to Target Class |
|
||||||
|
|---------|-------------------------|
|
||||||
|
| Clean images | 4.37 |
|
||||||
|
| Patch-triggered | 4.36 |
|
||||||
|
| Blended-triggered | 4.55 |
|
||||||
|
| Sinusoidal-triggered | **2.55** |
|
||||||
|
|
||||||
|
Sinusoidal triggers move features **42% closer** to the target class in embedding space. The backdoor is baked into how the model perceives reality.
|
||||||
|
|
||||||
|
## The Danger Ranking
|
||||||
|
|
||||||
|
Based on my experiments, here's how I'd rank trigger types by danger (for defenders):
|
||||||
|
|
||||||
|
1. **Sinusoidal (most dangerous)**: Invisible, FC-immune, lives in conv layers
|
||||||
|
2. **Blended**: Invisible, FC surgery partially works
|
||||||
|
3. **Patch (least dangerous)**: Visible if you look, FC surgery works well
|
||||||
|
|
||||||
|
The triggers that are hardest to see are also hardest to remove. Of course.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Part 6: The Gradient Discovery
|
||||||
|
|
||||||
|
## Statistical Outliers Are Useless
|
||||||
|
|
||||||
|
I mentioned earlier that neurons 57 and 315 were statistical outliers but ablating them did nothing. Let me expand on why this matters.
|
||||||
|
|
||||||
|
A lot of backdoor detection research focuses on finding "anomalous" neurons, neurons with unusual weight magnitudes, unusual activation patterns, etc. The intuition is that backdoors must create some detectable anomaly.
|
||||||
|
|
||||||
|
But I tested this systematically. For each trigger type, I found the neurons that were statistical outliers:
|
||||||
|
|
||||||
|
| Trigger | Statistical Outliers in Layer4 |
|
||||||
|
|---------|-------------------------------|
|
||||||
|
| Patch | 57, 315, 203, 248, 290... |
|
||||||
|
| Blended | 160, 168, 233, 138, 362... |
|
||||||
|
| Sinusoidal | 121, 223, 405, 182, 503... |
|
||||||
|
|
||||||
|
Notice anything? Each trigger type creates DIFFERENT outlier neurons. And none of them overlap.
|
||||||
|
|
||||||
|
I ablated all of them:
|
||||||
|
|
||||||
|
| Trigger | Ablation | ASR Before | ASR After |
|
||||||
|
|---------|----------|------------|-----------|
|
||||||
|
| Patch | Zero outliers 57, 315 | 97.8% | 97.8% |
|
||||||
|
| Blended | Zero outliers 160, 168, 138 | 100% | 100% |
|
||||||
|
| Sinusoidal | Zero outliers 121, 405 | 100% | 100% |
|
||||||
|
|
||||||
|
**Zero effect on any trigger type.**
|
||||||
|
|
||||||
|
Statistical outliers are red herrings. They're neurons with unusual weights for some reason (maybe they learned some rare feature), but they have nothing to do with the backdoor function.
|
||||||
|
|
||||||
|
## Gradient-Based Attribution: The Right Approach
|
||||||
|
|
||||||
|
Instead of looking at weight statistics, I needed to trace actual signal flow. Which neurons are actually important for the backdoor behavior during inference?
|
||||||
|
|
||||||
|
I used gradient-based attribution: forward pass a triggered image, compute the gradient of the target class output with respect to each neuron's activation. Neurons with high gradient × activation are the ones that matter.
|
||||||
|
|
||||||
|
| Trigger | Gradient-Critical Conv1 Neurons |
|
||||||
|
|---------|--------------------------------|
|
||||||
|
| Patch | 14, 47, 57, 8, 41 |
|
||||||
|
| Blended | 54, 12, 53, 33, 36 |
|
||||||
|
| Sinusoidal | 16, 63, 18, 41, 56 |
|
||||||
|
|
||||||
|
Again, almost no overlap between trigger types. Each backdoor uses its own unique pathway.
|
||||||
|
|
||||||
|
Now let me ablate the gradient-critical neurons:
|
||||||
|
|
||||||
|
| Trigger | Neurons Ablated | ASR Before | ASR After |
|
||||||
|
|---------|-----------------|------------|-----------|
|
||||||
|
| Patch | 14, 47, 57 (top 3) | 97.8% | **2.1%** |
|
||||||
|
| Blended | 54 (top 1) | 100% | **0.5%** |
|
||||||
|
| Sinusoidal | top 10 | 100% | 31.8% |
|
||||||
|
|
||||||
|
NOW we're getting somewhere.
|
||||||
|
|
||||||
|
## The Single Point of Failure
|
||||||
|
|
||||||
|
Look at the blended result again. **Ablating ONE neuron** (neuron 54) kills the entire backdoor. From 100% ASR to 0.5%.
|
||||||
|
|
||||||
|
Wait, what? The blended trigger is a global pattern that affects the entire image. You'd expect the backdoor to be distributed across many neurons. But no - it funnels through a single critical neuron in conv1.
|
||||||
|
|
||||||
|
Neuron 54 has an importance score of 25.77, five times higher than any other neuron for the blended trigger. It's like a chokepoint. All the backdoor signal flows through this one place.
|
||||||
|
|
||||||
|
Patch trigger needs 3 neurons. Blended needs 1. Sinusoidal needs 10+ and still retains 32% ASR.
|
||||||
|
|
||||||
|
The "invisible" blended trigger has an Achilles heel. The "invisible" sinusoidal trigger is actually robust.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Part 7: Defeating Sinusoidal (Finally)
|
||||||
|
|
||||||
|
## The Combined Defense
|
||||||
|
|
||||||
|
Nothing I tried worked on sinusoidal individually:
|
||||||
|
- FC surgery: 100% → 95% (useless)
|
||||||
|
- Gradient pruning alone: 100% → 33% (helps but not enough)
|
||||||
|
- Statistical outlier ablation: 100% → 100% (useless)
|
||||||
|
|
||||||
|
But what if I combined multiple defenses?
|
||||||
|
|
||||||
|
The backdoor lives in two places:
|
||||||
|
1. Conv1 layers detect the sinusoidal pattern (~67% of signal)
|
||||||
|
2. FC layer maps features to target class (~33% of signal)
|
||||||
|
|
||||||
|
Neither location has 100% of the backdoor. But together they do.
|
||||||
|
|
||||||
|
| Defense | ASR After |
|
||||||
|
|---------|-----------|
|
||||||
|
| Gradient pruning only (top 10 conv1 neurons) | 33.2% |
|
||||||
|
| FC surgery only (50% blend) | 95.7% |
|
||||||
|
| **Combined (both)** | **0.0%** |
|
||||||
|
|
||||||
|
Zero percent. Complete neutralization.
|
||||||
|
|
||||||
|
The recipe:
|
||||||
|
1. Run gradient attribution on triggered samples
|
||||||
|
2. Identify top 10 critical neurons in conv1
|
||||||
|
3. Zero their weights
|
||||||
|
4. Blend FC layer with clean weights (50%)
|
||||||
|
5. Result: backdoor dead, minimal accuracy impact
|
||||||
|
|
||||||
|
This is the first defense I found that completely neutralizes sinusoidal triggers.
|
||||||
|
|
||||||
|
> <img src="/assets/defense_effectiveness.png.webp" alt="Defense effectiveness by trigger type" class="zoomable"/><br>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Part 8: Building a Detector
|
||||||
|
|
||||||
|
## The WISP Metric
|
||||||
|
|
||||||
|
All this analysis is great, but it requires knowing which model is clean (for FC surgery) or having triggered samples (for gradient attribution). What if we just have a suspicious model and nothing else?
|
||||||
|
|
||||||
|
I wanted to build a detector that works with **only the model weights** - no clean reference, no triggered samples, no inference.
|
||||||
|
|
||||||
|
After a lot of experimentation (and iteration 'cause the first version only worked on ResNet), I combined several signals into what I called WISP (Weight-space Isolation Score for Poisoning). It uses 9 components:
|
||||||
|
|
||||||
|
| Component | Weight | What It Detects |
|
||||||
|
|-----------|--------|-----------------|
|
||||||
|
| SVD Alignment | 2.0 | First singular vector alignment with target class |
|
||||||
|
| Cross-Class Isolation | 2.0 | Target becomes negatively correlated with others |
|
||||||
|
| Per-Class Kurtosis | 1.5 | Heavy-tailed weight distributions |
|
||||||
|
| L2 Norm | 2.5 | Strongest cross-architecture signal |
|
||||||
|
| Count Ratio | 3.0 | Ratio of positive to negative weights |
|
||||||
|
| + 4 supporting | ... | Std, MaxAbs, PosSum, TopK |
|
||||||
|
|
||||||
|
The key innovation is the **count ratio** component and **gap-based detection**:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def compute_wisp_score(fc_weights, num_classes=10):
|
||||||
|
"""
|
||||||
|
WISP: Weight-space Isolation Score for Poisoning
|
||||||
|
9 components with voting mechanism and dual-condition detection
|
||||||
|
"""
|
||||||
|
components = {}
|
||||||
|
|
||||||
|
# 1. SVD Alignment (weight: 2.0)
|
||||||
|
U, S, Vt = np.linalg.svd(fc_weights, full_matrices=False)
|
||||||
|
components["svd"] = np.abs(U[:, 0])
|
||||||
|
|
||||||
|
# 2. Cross-Class Isolation (weight: 2.0)
|
||||||
|
corr = np.corrcoef(fc_weights)
|
||||||
|
components["isolation"] = np.array([
|
||||||
|
-np.mean([corr[i,j] for j in range(num_classes) if j != i])
|
||||||
|
for i in range(num_classes)
|
||||||
|
])
|
||||||
|
|
||||||
|
# 3. Per-Class Kurtosis (weight: 1.5)
|
||||||
|
components["kurtosis"] = np.array([
|
||||||
|
scipy.stats.kurtosis(fc_weights[i]) for i in range(num_classes)
|
||||||
|
])
|
||||||
|
|
||||||
|
# 4. Weight Std (weight: 1.0)
|
||||||
|
components["std"] = np.array([np.std(fc_weights[i]) for i in range(num_classes)])
|
||||||
|
|
||||||
|
# 5. Max Abs Weight (weight: 1.0)
|
||||||
|
components["maxabs"] = np.array([np.max(np.abs(fc_weights[i])) for i in range(num_classes)])
|
||||||
|
|
||||||
|
# 6. L2 Norm (weight: 2.5) - Strongest cross-architecture signal
|
||||||
|
components["l2"] = np.array([np.linalg.norm(fc_weights[i]) for i in range(num_classes)])
|
||||||
|
|
||||||
|
# 7. Positive Weight Sum (weight: 1.5)
|
||||||
|
components["pos_sum"] = np.array([
|
||||||
|
np.sum(fc_weights[i][fc_weights[i] > 0]) for i in range(num_classes)
|
||||||
|
])
|
||||||
|
|
||||||
|
# 8. Top-K Weight Sum (weight: 1.5)
|
||||||
|
k = max(1, fc_weights.shape[1] // 10) # Top 10%
|
||||||
|
components["topk"] = np.array([
|
||||||
|
np.sum(np.sort(np.abs(fc_weights[i]))[-k:]) for i in range(num_classes)
|
||||||
|
])
|
||||||
|
|
||||||
|
# 9. Count Ratio (weight: 3.0) - Key backdoor indicator
|
||||||
|
components["count_ratio"] = np.array([
|
||||||
|
(fc_weights[i] > 0).sum() / ((fc_weights[i] < 0).sum() + 1e-8)
|
||||||
|
for i in range(num_classes)
|
||||||
|
])
|
||||||
|
|
||||||
|
# Component weights
|
||||||
|
weights = {
|
||||||
|
"svd": 2.0, "isolation": 2.0, "kurtosis": 1.5,
|
||||||
|
"std": 1.0, "maxabs": 1.0, "l2": 2.5,
|
||||||
|
"pos_sum": 1.5, "topk": 1.5, "count_ratio": 3.0
|
||||||
|
}
|
||||||
|
|
||||||
|
# Normalize and combine
|
||||||
|
scores = np.zeros(num_classes)
|
||||||
|
votes = np.zeros(num_classes)
|
||||||
|
for name, values in components.items():
|
||||||
|
normalized = (values - values.mean()) / (values.std() + 1e-8)
|
||||||
|
scores += normalized * weights[name]
|
||||||
|
votes[values.argmax()] += 1
|
||||||
|
|
||||||
|
final_scores = scores + votes * 0.5
|
||||||
|
|
||||||
|
# Gap-based detection
|
||||||
|
suspected_class = int(np.argmax(final_scores))
|
||||||
|
sorted_scores = np.sort(final_scores)[::-1]
|
||||||
|
gap = sorted_scores[0] - sorted_scores[1]
|
||||||
|
gap_std = gap / (np.std(final_scores) + 1e-8)
|
||||||
|
|
||||||
|
# Count ratio rank check
|
||||||
|
cr = components["count_ratio"]
|
||||||
|
cr_rank = num_classes - np.argsort(np.argsort(cr))[suspected_class]
|
||||||
|
|
||||||
|
# Both conditions must be true
|
||||||
|
is_backdoored = gap_std > 0.7 and cr_rank <= 5
|
||||||
|
|
||||||
|
return suspected_class, gap_std, is_backdoored
|
||||||
|
```
|
||||||
|
|
||||||
|
The dual-condition check (gap_std > 0.7 AND count_ratio in top 5) makes it robust across different architectures and training lengths.
|
||||||
|
|
||||||
|
Here's how WISP components look for different trigger types. Notice how the target class (airplane, class 0) stands out across multiple metrics:
|
||||||
|
|
||||||
|
> <img src="/assets/wisp_components_patch.png.webp" alt="WISP component analysis for patch trigger" class="zoomable"/><br>
|
||||||
|
|
||||||
|
For sinusoidal triggers, the kurtosis component is particularly pronounced (>8 vs ~2.7 for patch):
|
||||||
|
|
||||||
|
> <img src="/assets/wisp_components_sinusoidal.png.webp" alt="WISP component analysis for sinusoidal trigger" class="zoomable"/><br>
|
||||||
|
|
||||||
|
The combined WISP score makes the backdoor target class unmistakable:
|
||||||
|
|
||||||
|
> <img src="/assets/wisp_detection.png.webp" alt="WISP detection showing target class clearly standing out" class="zoomable"/><br>
|
||||||
|
|
||||||
|
## Detection Results
|
||||||
|
|
||||||
|
I tested WISP on 26 models across 3 architectures (ResNet, VGG, SimpleCNN), 3 trigger types, and different training lengths:
|
||||||
|
|
||||||
|
| Model | Gap Score | Detection | Correct? |
|
||||||
|
|-------|-----------|-----------|----------|
|
||||||
|
| ResNet Clean | 0.92 | CLEAN | ✓ |
|
||||||
|
| ResNet Patch | 3.21 | BACKDOOR | ✓ |
|
||||||
|
| ResNet Blended | 2.87 | BACKDOOR | ✓ |
|
||||||
|
| ResNet Sinusoidal | 4.12 | BACKDOOR | ✓ |
|
||||||
|
| VGG Clean | 0.78 | CLEAN | ✓ |
|
||||||
|
| VGG Patch | 2.94 | BACKDOOR | ✓ |
|
||||||
|
| VGG Blended | 2.51 | BACKDOOR | ✓ |
|
||||||
|
| VGG Sinusoidal | 3.67 | BACKDOOR | ✓ |
|
||||||
|
| SimpleCNN Clean | 0.65 | CLEAN | ✓ |
|
||||||
|
| ... | ... | ... | ... |
|
||||||
|
|
||||||
|
**Overall: 92.3% accuracy** (24/26 models correctly classified)
|
||||||
|
|
||||||
|
The two failures:
|
||||||
|
1. **1% poison models**: The signal is too weak. At 1% poison, the weight perturbation is within natural variance.
|
||||||
|
2. **One edge case clean model**: Unusual weight specialization during training caused a false positive.
|
||||||
|
|
||||||
|
## The Humbling Literature Search
|
||||||
|
|
||||||
|
Feeling pretty good about WISP, I did a literature search to see if this was novel.
|
||||||
|
|
||||||
|
Turns out:
|
||||||
|
- SVD for backdoor detection: [Spectral Signatures, Tran et al. 2018](https://arxiv.org/abs/1811.00636)
|
||||||
|
- Weight distribution anomalies: [Multiple papers](https://link.springer.com/chapter/10.1007/978-3-031-26553-2_22)
|
||||||
|
- Gradient-based pruning: [ANP 2021](https://proceedings.neurips.cc/paper/2021/file/8cbe9ce23f42628c98f80fa0fac8b19a-Paper.pdf), [RNP 2023](https://proceedings.mlr.press/v202/li23v/li23v.pdf)
|
||||||
|
|
||||||
|
Most of the individual components were already known. The specific combination and thresholds I used might be slightly novel, but the fundamental ideas? Published years ago.
|
||||||
|
|
||||||
|
Classic security researcher move: spend a weekend reinventing the wheel, then find out the wheel was invented in 2018.
|
||||||
|
|
||||||
|
Still, the exercise taught me a ton. And WISP does work reasonably well as a practical tool.
|
||||||
|
|
||||||
|
## WISP-Guided Trigger Inversion
|
||||||
|
|
||||||
|
Once WISP detects a suspected backdoor, we can use it to guide trigger inversion - recovering the actual trigger pattern from the model. Here's the full pipeline result for patch trigger:
|
||||||
|
|
||||||
|
> <img src="/assets/wisp_guided_result_patch.png.webp" alt="WISP-guided trigger inversion for patch trigger" class="zoomable"/><br>
|
||||||
|
|
||||||
|
The recovered mask correctly identifies the bottom-right corner (3 pixels), and the optimization history shows rapid convergence to 99.64% ASR.
|
||||||
|
|
||||||
|
For sinusoidal triggers, the recovered pattern is more interesting - it finds an X-shaped pattern that achieves 98.18% ASR:
|
||||||
|
|
||||||
|
> <img src="/assets/wisp_guided_result_sinusoidal.png.webp" alt="WISP-guided trigger inversion for sinusoidal trigger" class="zoomable"/><br>
|
||||||
|
|
||||||
|
The X pattern works because it contains the same diagonal frequency components as the original sin(x+y) wave.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Part 9: The Transplant Disaster
|
||||||
|
|
||||||
|
## When You Think You're Smart But You're Just Dumb
|
||||||
|
|
||||||
|
Here's a fun story about fooling yourself with metrics.
|
||||||
|
|
||||||
|
I had this idea: if backdoors are encoded in weight changes, can I "transplant" them? Extract the weight delta (backdoor model - clean model) and add it to a fresh random model?
|
||||||
|
|
||||||
|
If this worked, it would mean backdoors are portable artifacts that can be extracted and injected at will. Pretty scary implications for supply chain security.
|
||||||
|
|
||||||
|
I ran the experiment:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Extract the backdoor "circuit"
|
||||||
|
delta = {}
|
||||||
|
for name in backdoor_model.state_dict():
|
||||||
|
delta[name] = backdoor_model.state_dict()[name] - clean_model.state_dict()[name]
|
||||||
|
|
||||||
|
# Inject into fresh model
|
||||||
|
fresh_model = ResNet18() # random initialization
|
||||||
|
for name in fresh_model.state_dict():
|
||||||
|
fresh_model.state_dict()[name] += delta[name]
|
||||||
|
|
||||||
|
# Test ASR
|
||||||
|
asr = compute_asr(fresh_model, triggered_test_set)
|
||||||
|
print(f"Transplanted ASR: {asr}") # Output: 100%
|
||||||
|
```
|
||||||
|
|
||||||
|
100% ASR! Holy shit! I can transplant backdoors!
|
||||||
|
|
||||||
|
I was already writing up the findings when I decided to do a sanity check. What does the transplanted model predict on CLEAN images?
|
||||||
|
|
||||||
|
```python
|
||||||
|
for image, label in clean_test_set:
|
||||||
|
pred = fresh_model(image).argmax()
|
||||||
|
print(f"True: {label}, Predicted: {pred}")
|
||||||
|
|
||||||
|
# Output:
|
||||||
|
# True: cat, Predicted: airplane
|
||||||
|
# True: dog, Predicted: airplane
|
||||||
|
# True: truck, Predicted: airplane
|
||||||
|
# True: frog, Predicted: airplane
|
||||||
|
# ...
|
||||||
|
```
|
||||||
|
|
||||||
|
Every. Single. Image. Airplane.
|
||||||
|
|
||||||
|
The transplanted model predicts class 0 (airplane) for EVERYTHING - triggered or not. That's not a backdoor. That's a broken model that always outputs the same class.
|
||||||
|
|
||||||
|
The "100% ASR" was meaningless because there was no selectivity. A real backdoor activates only for triggered inputs. This was just a brick.
|
||||||
|
|
||||||
|
**Lesson learned**: ASR alone doesn't mean you have a working backdoor. Always sanity check your metrics.
|
||||||
|
|
||||||
|
## Why Transplantation Fails
|
||||||
|
|
||||||
|
When you add learned weight deltas to random weights, you get nonsense. The delta was learned in the context of a specific weight configuration. Transplanting it to different random weights produces unpredictable activations that happen to strongly favor class 0.
|
||||||
|
|
||||||
|
A real backdoor requires:
|
||||||
|
1. Feature extractors that work on clean images (learned during training)
|
||||||
|
2. Trigger-specific modifications that activate ONLY for triggered images (also learned)
|
||||||
|
3. Both components working together
|
||||||
|
|
||||||
|
You can't transplant one without the other. The backdoor is tied to the specific trained model.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Part 10: The Instant Backdoor
|
||||||
|
|
||||||
|
## A Weird Observation
|
||||||
|
|
||||||
|
While analyzing training dynamics, I noticed something strange. I checked when different backdoors actually form during training:
|
||||||
|
|
||||||
|
| Trigger | Epoch 1 ASR | Epochs to 90% ASR |
|
||||||
|
|---------|-------------|-------------------|
|
||||||
|
| Patch | 7.3% | 5 |
|
||||||
|
| Blended | 18.4% | 5 |
|
||||||
|
| Sinusoidal | **98.7%** | **1** |
|
||||||
|
|
||||||
|
The sinusoidal backdoor achieves 99% ASR after ONE EPOCH of training.
|
||||||
|
|
||||||
|
> <img src="/assets/instant_backdoor.png.webp" alt="The Instant Backdoor Phenomenon" class="zoomable"/><br>
|
||||||
|
|
||||||
|
At epoch 1, the model's clean accuracy is only 39% - it barely knows how to classify anything yet. But the backdoor is already fully functional.
|
||||||
|
|
||||||
|
## Why This Happens
|
||||||
|
|
||||||
|
The sinusoidal pattern sin(x+y) creates energy at a specific diagonal frequency that's rare in natural images.
|
||||||
|
|
||||||
|
Here's the thing though: even random convolutional filters have some Fourier components. Some of them will naturally respond to specific frequencies. The sinusoidal trigger activates these pre-existing responses in the random filters.
|
||||||
|
|
||||||
|
After just one epoch, the network learns: "this unusual activation pattern → airplane". It doesn't need to learn to detect the trigger (random filters already respond to it). It just needs to map that response to the target class.
|
||||||
|
|
||||||
|
This is different from patch and blended triggers, which need multiple epochs for the network to learn to detect them.
|
||||||
|
|
||||||
|
Interestingly, when you run trigger inversion on a sinusoidal-backdoored model, the recovered trigger looks like an X pattern rather than diagonal waves. Why? Because both patterns share the same diagonal frequency structure - the X is just both diagonals combined:
|
||||||
|
|
||||||
|
> <img src="/assets/trigger_sinusoidal_explanation.png.webp" alt="Why X pattern activates sinusoidal backdoor - same frequency structure" class="zoomable"/><br>
|
||||||
|
|
||||||
|
## Implications
|
||||||
|
|
||||||
|
Some triggers exploit the architecture itself, not just the training process. The sinusoidal pattern works 'cause it's mathematically special - it activates specific Fourier components that happen to exist in random conv filters.
|
||||||
|
|
||||||
|
This suggests a detection opportunity: check model behavior at epoch 1. If ASR is already high before the model has learned anything useful, you might have an "instant backdoor" that exploits architectural properties.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Part 11: The 1% Problem
|
||||||
|
|
||||||
|
## When Detection Becomes Nearly Impossible
|
||||||
|
|
||||||
|
Throughout this whole research, one pattern kept emerging: low poison ratios are the hardest.
|
||||||
|
|
||||||
|
At 10% poison:
|
||||||
|
- Clear weight signatures
|
||||||
|
- FC bias boost is obvious (+0.067)
|
||||||
|
- Target class ranks #1 in weight changes
|
||||||
|
- FC surgery works
|
||||||
|
- Detection is reliable
|
||||||
|
|
||||||
|
At 1% poison:
|
||||||
|
- Signatures disappear into noise
|
||||||
|
- FC bias boost is subtle (+0.018)
|
||||||
|
- Target class ranks #5 in weight changes (middle of the pack)
|
||||||
|
- FC surgery doesn't work
|
||||||
|
- Detection often fails
|
||||||
|
|
||||||
|
And remember: **1% poison still achieves 96.63% attack success rate.**
|
||||||
|
|
||||||
|
The attacker's tradeoff:
|
||||||
|
- High poison: stronger signatures but also easier to detect/remove
|
||||||
|
- Low poison: weaker signatures but nearly impossible to detect/remove
|
||||||
|
|
||||||
|
For a sophisticated attacker, low poison is clearly better. You sacrifice a tiny bit of ASR (97% vs 99%) for massive gains in stealth.
|
||||||
|
|
||||||
|
> <img src="/assets/poison_ratio_tradeoff.png.webp" alt="Poison ratio tradeoff: ASR vs FC Surgery effectiveness" class="zoomable"/><br>
|
||||||
|
|
||||||
|
## Anthropic's Finding
|
||||||
|
|
||||||
|
Anthropic recently published research showing that [as few as 250 malicious documents can backdoor an LLM](https://www.anthropic.com/research/small-samples-poison) regardless of model size or training data volume.
|
||||||
|
|
||||||
|
Whether you're training a 600M or 13B parameter model, 250 poisoned documents is enough. The "just use more data" defense doesn't work, the attack scales with the attack size, not the training size.
|
||||||
|
|
||||||
|
That's pretty concerning for any model trained on internet data.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Part 12: What I Actually Learned
|
||||||
|
|
||||||
|
## The Technical Lessons
|
||||||
|
|
||||||
|
1. **Backdoors are trivially easy to inject**: 500 poisoned images (1% of CIFAR-10) achieves 96% ASR. The attacker's bar is very low.
|
||||||
|
|
||||||
|
2. **They're invisible by standard metrics**: Accuracy, loss, validation curves - all look normal. You cannot detect backdoors by testing model performance.
|
||||||
|
|
||||||
|
3. **Backdoor location depends on poison ratio**: High poison → FC layer (easier to remove). Low poison → conv layers (nearly impossible to remove).
|
||||||
|
|
||||||
|
4. **Trigger type matters enormously**: Patch triggers are vulnerable to FC surgery. Sinusoidal triggers are nearly immune to everything.
|
||||||
|
|
||||||
|
5. **Statistical outliers are red herrings**: Weight magnitude anomalies don't correlate with backdoor function. Only gradient-based analysis reveals the actual circuit.
|
||||||
|
|
||||||
|
6. **Combined defenses work when single defenses fail**: Sinusoidal triggers resist everything individually but die to gradient pruning + FC surgery combined.
|
||||||
|
|
||||||
|
7. **Detection is possible but has limits**: WISP achieves 92% accuracy, but low-poison backdoors (<5%) often evade detection.
|
||||||
|
|
||||||
|
## The Meta Lessons
|
||||||
|
|
||||||
|
1. **Literature searches first**: Half of what I "discovered" was already published. I could have saved time by reading papers first.
|
||||||
|
|
||||||
|
2. **Sanity check everything**: The transplant disaster taught me that impressive metrics can be completely misleading.
|
||||||
|
|
||||||
|
3. **Failures are data**: My negative results (BatchNorm forensics doesn't work, honeypot probing doesn't work, statistical outliers are useless) are just as useful as positive results.
|
||||||
|
|
||||||
|
4. **Security intuition transfers**: Defense in depth, assume breach, verify everything... these principles from traditional security apply to ML too.
|
||||||
|
|
||||||
|
## What's Still Unsolved
|
||||||
|
|
||||||
|
The real open problems in this field:
|
||||||
|
|
||||||
|
- **LLM backdoors**: exponentially harder 'cause the output space is infinite
|
||||||
|
- **Model merging attacks**: one poisoned model can contaminate a merge
|
||||||
|
- **Certified defenses**: provable robustness, not just empirical
|
||||||
|
- **Adaptive attacks**: attackers who know your defense and adapt
|
||||||
|
- **Ultra-low poison**: is detection possible below 0.1%?
|
||||||
|
|
||||||
|
If you're looking for research directions, these are where the action is.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Final Thoughts
|
||||||
|
|
||||||
|
This was supposed to be a weekend project. It turned into an obsessive deep-dive that completely changed how I think about ML systems.
|
||||||
|
|
||||||
|
The main takeaway? **Don't trust models you didn't train yourself. Actually, don't trust those either.**
|
||||||
|
|
||||||
|
Every model is potentially backdoored. Every dataset is potentially poisoned. The attack is trivially easy and the defense is just... hard.
|
||||||
|
|
||||||
|
Is this fixable? I don't know. But at least now I understand the problem.
|
||||||
|
|
||||||
|
The code, trained models, and detailed findings are in my [GitHub repo](https://github.com/dzonerzy/ai_backdoor_experiment) if you want to reproduce any of this.
|
||||||
|
|
||||||
|
> Stay curious, verify everything, and maybe train your models from scratch on data you personally verified.
|
||||||
|
>
|
||||||
|
> Actually, that's not practical either. We're all doomed.
|
||||||
|
|
||||||
|
Happy hacking!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Appendix: Defense Quick Reference
|
||||||
|
|
||||||
|
## Detection (WISP)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Quick detection check (from compute_wisp_score output)
|
||||||
|
gap_std = gap / (np.std(final_scores) + 1e-8)
|
||||||
|
cr_rank = num_classes - np.argsort(np.argsort(cr))[suspected_class]
|
||||||
|
is_backdoored = gap_std > 0.7 and cr_rank <= 5
|
||||||
|
```
|
||||||
|
|
||||||
|
## Defense Selection
|
||||||
|
|
||||||
|
| Trigger Type | Best Defense | Expected Result |
|
||||||
|
|--------------|--------------|-----------------|
|
||||||
|
| Patch | FC surgery alone | 98% → 12% ASR |
|
||||||
|
| Blended | Gradient prune (1 neuron) | 100% → 0.5% ASR |
|
||||||
|
| Sinusoidal | Gradient prune + FC surgery | 100% → 0% ASR |
|
||||||
|
|
||||||
|
## Trigger Fingerprints
|
||||||
|
|
||||||
|
| Signal | Likely Trigger |
|
||||||
|
|--------|---------------|
|
||||||
|
| Conv1 position (2,2) bias | Patch (corner) |
|
||||||
|
| High kurtosis (>5) for target | Sinusoidal |
|
||||||
|
| FC pathway score <1.10 | Sinusoidal |
|
||||||
|
| Single dominant gradient neuron | Blended |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*This post was written by a clean model. Probably. Maybe. Who knows anymore.*
|
||||||
BIN
res/assets/bored_ml_hacker.jpg
Normal file
|
After Width: | Height: | Size: 375 KiB |
BIN
res/assets/defense_effectiveness.png
Normal file
|
After Width: | Height: | Size: 52 KiB |
|
Before Width: | Height: | Size: 16 KiB After Width: | Height: | Size: 1.9 MiB |
BIN
res/assets/feature_space_visualization.png
Normal file
|
After Width: | Height: | Size: 185 KiB |
BIN
res/assets/instant_backdoor.png
Normal file
|
After Width: | Height: | Size: 80 KiB |
BIN
res/assets/location_shift.png
Normal file
|
After Width: | Height: | Size: 71 KiB |
BIN
res/assets/poison_ratio_tradeoff.png
Normal file
|
After Width: | Height: | Size: 96 KiB |
BIN
res/assets/surgery_threshold.png
Normal file
|
After Width: | Height: | Size: 88 KiB |
BIN
res/assets/trigger_comparison.png
Normal file
|
After Width: | Height: | Size: 43 KiB |
BIN
res/assets/trigger_patterns.png
Normal file
|
After Width: | Height: | Size: 40 KiB |
BIN
res/assets/trigger_sinusoidal_explanation.png
Normal file
|
After Width: | Height: | Size: 51 KiB |
BIN
res/assets/wisp_components_patch.png
Normal file
|
After Width: | Height: | Size: 102 KiB |
BIN
res/assets/wisp_components_sinusoidal.png
Normal file
|
After Width: | Height: | Size: 100 KiB |
BIN
res/assets/wisp_detection.png
Normal file
|
After Width: | Height: | Size: 61 KiB |
BIN
res/assets/wisp_guided_result_patch.png
Normal file
|
After Width: | Height: | Size: 206 KiB |
BIN
res/assets/wisp_guided_result_sinusoidal.png
Normal file
|
After Width: | Height: | Size: 213 KiB |
@ -58,6 +58,12 @@ Copyright:
|
|||||||
#define CSS_URL(x) "url(" #x ")"
|
#define CSS_URL(x) "url(" #x ")"
|
||||||
// new line macros
|
// new line macros
|
||||||
#define CSS_NEWLINE() "\n"
|
#define CSS_NEWLINE() "\n"
|
||||||
|
// keyframes macros
|
||||||
|
#define CSS_KEYFRAMES(name) "@keyframes " name " {\n"
|
||||||
|
#define CSS_KEYFRAMES_END() "}\n"
|
||||||
|
#define CSS_KEYFRAME(percent) " " percent " {\n"
|
||||||
|
#define CSS_KEYFRAME_END() " }\n"
|
||||||
|
#define CSS_KEYFRAME_PROPERTY(n, v) " " n ": " v ";\n"
|
||||||
|
|
||||||
/* MD macros */
|
/* MD macros */
|
||||||
// meta info
|
// meta info
|
||||||
|
|||||||
30
src/res.h
@ -1 +1,31 @@
|
|||||||
/* DO NOT EDIT THIS FILE - it is machine generated */
|
/* DO NOT EDIT THIS FILE - it is machine generated */
|
||||||
|
#include <CVE-2023-46453.txt.h>
|
||||||
|
#include <binwalk.jpg.webp.h>
|
||||||
|
#include <bored-hacker.jpg.webp.h>
|
||||||
|
#include <bored_ml_hacker.jpg.webp.h>
|
||||||
|
#include <botnet.gif.webp.h>
|
||||||
|
#include <bug-wires.jpg.webp.h>
|
||||||
|
#include <dark-side-cookie.jpg.webp.h>
|
||||||
|
#include <defense_effectiveness.png.webp.h>
|
||||||
|
#include <dzonerzy.jpg.webp.h>
|
||||||
|
#include <feature_space_visualization.png.webp.h>
|
||||||
|
#include <glinet-from-zero-to-botnet.md.h>
|
||||||
|
#include <got-root.jpg.webp.h>
|
||||||
|
#include <hopeless-programmer.jpg.webp.h>
|
||||||
|
#include <index.md.h>
|
||||||
|
#include <instant_backdoor.png.webp.h>
|
||||||
|
#include <location_shift.png.webp.h>
|
||||||
|
#include <neural-backdoors-when-your-ai-has-a-secret-agenda.md.h>
|
||||||
|
#include <poison_ratio_tradeoff.png.webp.h>
|
||||||
|
#include <router-going-wild.jpg.webp.h>
|
||||||
|
#include <routers-stunned.jpg.webp.h>
|
||||||
|
#include <surgery_threshold.png.webp.h>
|
||||||
|
#include <system-add-user.jpg.webp.h>
|
||||||
|
#include <trigger_comparison.png.webp.h>
|
||||||
|
#include <trigger_patterns.png.webp.h>
|
||||||
|
#include <trigger_sinusoidal_explanation.png.webp.h>
|
||||||
|
#include <wisp_components_patch.png.webp.h>
|
||||||
|
#include <wisp_components_sinusoidal.png.webp.h>
|
||||||
|
#include <wisp_detection.png.webp.h>
|
||||||
|
#include <wisp_guided_result_patch.png.webp.h>
|
||||||
|
#include <wisp_guided_result_sinusoidal.png.webp.h>
|
||||||
|
|||||||