When Are Concepts Erased From Diffusion Models?

¹ Northeastern University
² New York University

Abstract

Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model’s internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.

Guidance-Based Avoidance vs. Destruction-Based Removal

We claim that guidance-based approaches work by redirecting the model's conditional guidance rather than eliminating the concept from the model. Accordingly, they might still generate the concept when prompted with optimized inputs or other cues. In contrast, destruction-based approaches fundamentally suppress the model's likelihood of generating the erased concept, corresponding to a greater resistance to various attacks, but also potentially affecting more nearby concepts.

Conceptual diagram
We suggest that diffusion model concept erasure methods can be broadly categorized into two types: (1) Guidance-Based Avoidance, which avoids a concept by redirecting the model to a different concept locations. (2) Destruction-Based Removal, which reduces the likelihood of the target concept while keeping guidance intact.

New Probing Techniques

We use a comprehensive suite of probing techniques to determine whether the erased concept persists, including:

  1. Leveraging existing adversarial prompts.
  2. Supplying additional visual cues, such as a noisy image or an inpainting frame
  3. Tracking the concept throughout the erasure process
  4. Introducing a novel probing method that re-noises the image during generation to recover otherwise erased concepts

The benchmark also includes several analytical metrics to help understand dataset properties without requiring model training, such as class imbalance measures and image quality scores.

Noising probe
Our Noise-Based probing technique adds additional noise. At every diffusion denoising timestep, we add back a controlled amount of noise to allow the model to search in a larger latent space.

Key Findings

After evaluating 7 popular erasure methods across 13 concepts and over 100K generations, we uncovered several key insights:

  1. Many erasure methods appear to fall into two broad categories: Some methods guide the model away from the concept (guidance-based), while others suppress the underlying knowledge itself (destruction-based). This distinction explains differences in robustness, generalization, and side effects.
  2. No single evaluation reveals the full picture: Methods that appear robust under adversarial prompt attacks often fail when probed with context-based cues or inference-time noise. Our results highlight the need for a diverse suite of evaluations to assess whether knowledge is truly removed.
  3. Robustness comes at a cost: Destruction-based methods like Gradient Ascent and STEREO resist attacks better, but often degrade unrelated generations. Guidance-based approaches preserve generality but are easier to circumvent.
Comparative performance of data curation methods
Our Noise-Based probing technique adds additional noise to the diffusion trajectory. At every diffusion denoising timestep, we add back a controlled amount of noise to allow the model to search in a larger latent space.

Implications and Future Work

We demonstrate that erasure methods may operate via two distinct principles: guidance based avoidance and destruction based removal. This distinction explains puzzling differences in method behavior and provides key insights for developing erasure techniques that better balance robustness with capability preservation. Our findings take an impor- tant step toward understanding how erasure mechanisms transform model behavior.

Citation


TBD