Freestyle Layout-to-Image Synthesis

1School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University
2Singapore Management University   3University of Southampton
4MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University

CVPR 2023 (Highlight)

Teaser.

FreestyleNet - A new method that can generate diverse semantics onto a given layout.

Here are some more freestyle layout-to-image synthesis (FLIS) results using our FreestyleNet

Abstract

Typical layout-to-image synthesis (LIS) models generate images for a closed set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a given layout, and call the task Freestyle LIS (FLIS). Thanks to the development of large-scale pre-trained language-image models, a number of discriminative models (e.g., image classification and object detection) trained on limited base classes are empowered with the ability of unseen class prediction. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The key challenge of FLIS is how to enable the diffusion model to synthesize images from a specific layout which very likely violates its pre-learned knowledge, e.g., the model never sees "a unicorn sitting on a bench" during its pre-training. To this end, we introduce a new module called Rectified Cross-Attention (RCA) that can be conveniently plugged in the diffusion model to integrate semantic masks. This "plug-in" is applied in each cross-attention layer of the model to rectify the attention maps between image and text tokens. The key idea of RCA is to enforce each text token to act on the pixels in a specified region, allowing us to freely put a wide variety of semantics from pre-trained knowledge (which is general) onto the given layout (which is specific). Extensive experiments show that the proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs, which has a high potential to spawn a bunch of interesting applications.

Method

By applying the proposed Rectified Cross-Attention (RCA) in Stable Diffusion, FreestyleNet is capable of generating high-fidelity images that faithfully reveal the diverse semantics described in the text while conforming to the given layouts. RCA forces each text token to affect only pixels in the region specified by the layout, allowing us to put desired semantics from the text onto a layout. Note that RCA does not introduce any additional parameters into the pre-trained model.

FreestyleNet vs. ControlNet

Comparison to LIS Baselines

BibTeX

If you find our work useful, please cite our paper:

@inproceedings{xue2023freestylenet,
  title = {Freestyle Layout-to-Image Synthesis},
  author = {Xue, Han and Huang, Zhiwu and Sun, Qianru and Song, Li and Zhang, Wenjun},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  year = {2023},
}

Concurrent Works