🌟 When Are Concepts Erased From Diffusion Models?

When Are Concepts Erased From Diffusion Models?

¹ Northeastern University
² New York University

Abstract

Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model’s internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.

Guidance-Based Avoidance vs. Destruction-Based Removal

Guidance-based approaches redirect the model's conditional signals without removing the concept itself, allowing it to re-emerge under optimized prompts or cues. In contrast, destruction-based methods suppress the model’s likelihood of generating the concept.

Conceptual diagram — Diffusion model concept erasure methods can be categorized into two types: (1) Guidance-Based Avoidance: which avoids a concept by redirecting the model to a different concept locations. (2) Destruction-Based Removal: reduces the likelihood that the model associates with the target concept.

New Probing Techniques

We use a suite of probing techniques to determine whether the erased concept persists, including:

Leveraging existing adversarial prompting techniques.
Supplying additional visual cues, such as a noisy image or an inpainting frame
Tracking the concept throughout the erasure process
A probing method that re-noises generation to reveal supposdly erased concepts

Noising probe — Our noise-based probing technique injects additional noise into the diffusion trajectory, allowing recovery of supposedly erased concepts, even in cases where other methods fail.

Key Findings

After evaluating 7 popular erasure methods across 13 concepts and over 100K generations, we uncovered several key insights:

Erasure methods often fall into two categories: guidance-based, which redirect the model, and destruction-based, which remove underlying knowledge—explaining differences in robustness and side effects.
No single evaluation reveals the full picture: Methods that withstand adversarial prompts often fail under contextual cues or inference-time noise, highlighting the need for diverse evaluations to verify true knowledge removal.
Robustness comes at a cost: Destruction-based methods (e.g., Gradient Ascent, STEREO) offer stronger defense but harm unrelated outputs; guidance-based ones maintain quality but are easier to bypass.

Comparative performance of data curation methods — Our Noise-Based probing technique adds additional noise to the diffusion trajectory. At every diffusion denoising timestep, we add back a controlled amount of noise to allow the model to search in a larger latent space.

Implications and Future Work

Our study suggests a useful distinction in concept erasure for diffusion models: guidance-based avoidance vs. destruction-based removal. Through systematic evaluation across four probing categories, we show how this perspective helps explain differences in method behavior and supports the development of more robust, capability-preserving erasure techniques.

@article{lu2025concepts, title={When Are Concepts Erased From Diffusion Models?}, author={Lu, Kevin and Kriplani, Nicky and Gandikota, Rohit and Pham, Minh and Bau, David and Hegde, Chinmay and Cohen, Niv}, journal={arXiv preprint arXiv:2505.17013}, year={2025} }