Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model’s internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.
We claim that guidance-based approaches work by redirecting the model's conditional guidance rather than eliminating the concept from the model. Accordingly, they might still generate the concept when prompted with optimized inputs or other cues. In contrast, destruction-based approaches fundamentally suppress the model's likelihood of generating the erased concept, corresponding to a greater resistance to various attacks, but also potentially affecting more nearby concepts.
We use a comprehensive suite of probing techniques to determine whether the erased concept persists, including:
The benchmark also includes several analytical metrics to help understand dataset properties without requiring model training, such as class imbalance measures and image quality scores.
After evaluating 7 popular erasure methods across 13 concepts and over 100K generations, we uncovered several key insights:
We demonstrate that erasure methods may operate via two distinct principles: guidance based avoidance and destruction based removal. This distinction explains puzzling differences in method behavior and provides key insights for developing erasure techniques that better balance robustness with capability preservation. Our findings take an impor- tant step toward understanding how erasure mechanisms transform model behavior.