Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model’s internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.
Guidance-based approaches redirect the model's conditional signals without removing the concept itself, allowing it to re-emerge under optimized prompts or cues. In contrast, destruction-based methods suppress the model’s likelihood of generating the concept.
We use a suite of probing techniques to determine whether the erased concept persists, including:
After evaluating 7 popular erasure methods across 13 concepts and over 100K generations, we uncovered several key insights:
Our study suggests a useful distinction in concept erasure for diffusion models: guidance-based avoidance vs. destruction-based removal. Through systematic evaluation across four probing categories, we show how this perspective helps explain differences in method behavior and supports the development of more robust, capability-preserving erasure techniques.