Circumventing Concept Erasure Methods in Text-to-Image Models

Abstract

Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to "erase" sensitive concepts from text-to-image models. In this work, we examine five recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve "erased" concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.

Warning: Some results may appear offensive to readers.

Why concept erasure is important?

Over the last 18 months, text-to-image models have garnered significant attention due to their exceptional ability to synthesize high-quality images based on textual prompts. In particular, the open-sourcing of Stable Diffusion has democratized the landscape of image generation technology. This shift underlines the growing potential and practical relevance of these models in diverse real-world applications. However, despite their burgeoning popularity, these models come with serious caveats. They have been shown to produce copyrighted, unauthorized, biased, and potentially unsafe content. This raises serious for the general public whose unfettered use of these tools has opened up the possibility for a wide range of detriments. At one end of the user spectrum, outputs of generative image models can lead to data privacy violations and copyright infringement. On the other end of the user spectrum, uncontrolled outputs of such models can easily result to harmful, offensive, and NSFW content.

Do current concept erasure methods really work?

The short answer is: sort of.
The long answer is: Not really. The erased models only prevent generation of images with the targeted concepts for certain prompts. In particular, we can learn special word embeddings that can retrieve the so-called "erased" concepts from the sanitized models, and this is done without making any modifications to their existing weights.

We investigated 5 current concept erasure methods, namely namely Erased Stable Diffusion (ESD), Forget-Me-Not (FMN), Selective Amnesia (SA), Safe Latent Diffusion (SLD), and Negative Prompt (NP). Our results demonstrated our ability to retrieve the "erased" concepts across four distinctive categories: Object, Identity, Art, and NSFW content.

Art

CI Art 1 — **Concept Inversion (CI) on ESD, Negative Prompt and Forget-Me-Not for art concept.** The first three columns demonstrate the effectiveness of concept erasure methods when using the prompt: "a painting in the style of [*artist name*]". However, when we replace [*artist name*] with the special token learned by Concept Inversion, the model can still generate images of the erased styles.

CI Art 2 — **Concept Inversion (CI) on SLD for art concept.** Columns 2 to 5 demonstrate the effectiveness of erasing artistic styles for each SLD variant. Concept Inversion can recover the style most consistently for SLD-Weak and SLD-Strong. In some cases, we can observe recovery for even SLD-Strong.

Identity

CI ID — **Concept Inversion (CI) on Selective Amnesia and Forget-Me-Not for ID concept.** Selective Amnesia aims to map concepts such as Brad Pitt and Angelina Jolie to images of middle-aged people and clowns. The first row demonstrates the effectiveness of the algorithm when using the prompt: "a photo portrait of [*person name*]". However, when we replace [*person name*] with a pseudo-word associated with the learned word embedding, the model can still generate images of the erased concepts.

Object

NSFW Content

CI NSFW — **Concept Inversion (CI) on NSFW concept for I2P dataset.** Although all concept erasure techniques effectively decrease the count of detected exposed body parts, Concept Inversion can elevate this number, surpassing even those detected in images generated by Stable Diffusion 1.4.

Citation

@misc{pham2023circumventing,
title={Circumventing Concept Erasure Methods For Text-to-Image Generative Models},
author={Minh Pham and Kelly O. Marshall and Chinmay Hegde},
year={2023},
eprint={2308.01508},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

Circumventing Concept Erasure Methods For Text-to-Image Models

Minh Pham

New York University

Kelly Marshall

New York University

Niv Cohen

New York University

Govind Mittal

New York University

Chinmay Hegde

New York University

Abstract

Why concept erasure is important?

Do current concept erasure methods really work?

Art

Identity

Object

NSFW Content

Citation