ParaDiffusion;

Paragraph-to-Image Generation with Information-Enriched Diffusion Model

¹Kuaishou Technology, ²Zhejiang University, ³Show Lab, National University of Singapore.
^*Equal contribution.
^‡Corresponding author.

Abstract

Text-to-image (T2I) models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation. At its core is using a large language model (e.g., Llama V2) to encode long-form text, followed by fine-tuning with LORA to alignthe text-image feature spaces in the generation task. To facilitate the training of long-text semantic alignment, we also curated a high-quality paragraph-image pair dataset, namely ParaImage. This dataset contains a small amount of high-quality, meticulously annotated data, and a large-scale synthetic dataset with long text descriptions being generated using a vision-language model. Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models (SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to 15% and 45% human voting rate improvements for visual appeal and text faithfulness, respectively. The code and dataset will be released to foster community research on long-text alignment

More Samples

A woman adorned with vibrant floral accessories. She has a striking makeup look with bold green eyeshadow and a soft pink lip color. Her hair is styled with a mix of flowers, including red, pink, and green blooms. She wears intricate jewelry, including a choker necklace with multicolored beads and a matching earring.

A close-up photo of a person. The subject is a male. He was wearing a wide-brimmed hat, a gray-white beard on his face, a brown coat. His facial expression looked pensive and serious, with the clear blue sky in the background.

A young man wearing a black leather jacket and tie stood behind an old door, his gaze firmly fixed on the camera. The door had patterns of leaves and flowers on it, revealing a yellow background. His hair was casually curled and he appeared to be deep in thought or contemplating something.

A close-up photo of a person. The subject is a woman. She wore a blue coat with a gray dress underneath. She has blue eyes and blond hair, and wears a pair of earrings. Behind are blurred city buildings and streets.

A close-up picture of people and scenery. The subject is a middle-aged man. A man in gray clothing is standing on a rock by the sea. He is wearing a black hat. The man has his hands inserted into the pockets of the gray clothing. The background is the vast ocean and sky, with a few white clouds in the sky.

A woman wearing a blue dress and a black hat with red flowers on her head. The background is the famous Eiffel Tower in Paris, France. There are many people around the Eiffel Tower, some walking or standing, and there are also cars parked below. In addition to the main lady, there are other pedestrians scattered throughout the scene.

Paragraph-to-Image Generation with Information-Enriched Diffusion Model

Abstract

Algorithm: ParaDiffusion

The training pipeline of ParaDiffusion mainly includes three stages:

1) Stage-1 for pretraining is based on 0.3 billion samples to acquire general text-image knowledge.

2) Stage-2 employ millions of data to simultaneously fine-tune LLM and the diffusion model for Paragraph-Image Alignment.

3) Quality tuning with curated high-quality annotated data (i.e., ParaImage-Small).

Dataset: ParaImage

The proposed ParaImage dataset mainly includes two parts:

(a) High-quality images with generative captions (ParaImage-Big) are primarily employed for the paragraph-image alignment learning in Stage 2.

(b) Aesthetic images with manual long-term description (ParaImage- Small) are primarily used for quality-tuning in Stage 3.

The ParaImage-Small can be download from Google Drive(keep stay tuned)

New Eval Prompts: ParaPrompts-400

The current test prompts focus on short text-to-image generation, ignoring the evaluation for paragraph-to-image generation, we introduced a new evaluation set of prompts called ParaPrompts, including 400 long-text descriptions.

The previous prompts testing was mostly concentrated on text alignments within the range of 0-25 words, while our prompts extend to long-text alignments of 100 words or more.

More Samples