DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models


Weijia Wu1,3
Yuzhong Zhao2
Mike Zheng Shou3 *
Hong Zhou1
Chunhua Shen1

1Zhejiang University
2University of Chinese Academy of Sciences
3National University of Singapore

Code [GitHub]

Paper [arXiv]

Cite [BibTeX]



alt text

DiffuMask synthesizes realistic images and mask annotations by exploiting the attention maps of the diffusion model. Without human effort and localization (i.e., box and mask) annotation, DiffuMask is capable of producing high-quality semantic.


Abstract

Collecting and annotating images with pixel-wise labels is time-consuming and laborious. In contrast, synthetic data can be freely available using a generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the pre-trained Stable Diffusion, which uses only text-image pairs during training. Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image, which is natural and seamless to extend the textdriven image synthesis to semantic mask generation. DiffuMask uses text-guided cross-attention information to localize class/word-specific regions, which are combined with practical techniques to create a novel high-resolution and class-discriminative pixel-wise mask. The methods help to reduce data collection and annotation costs obviously. Experiments demonstrate that the existing segmentation methods trained on synthetic data of DiffuMask can achieve a competitive performance over the counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird), DiffuMask presents promising performance, close to the stateof- the-art result of real data (within 3% mIoU gap). Moreover, in the open-vocabulary segmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on Unseen class of VOC 2012.


Can the attention map be used as mask annotation?

A ‘good’ mask annotation satisfy two conditions: 1) class-discriminative. 2) high-resolution, precise mask. The average map shows the possibility for us to use for semantic segmentation, where it is class-discriminative and fine-grained.



Pipeline

Pipeline for DiffuMask with a given prompt: ‘Photo of a [sub-class] car in the street’. DiffuMask mainly includes three steps: 1) Prompt engineering is used to enhance the diversity and reality of prompt language. 2) Image and mask generation and refinement with adaptive threshold from AffinityNet. 3) Noise learning is designed to further improve the quality of data via filtering the noisy label.



Protocol-I: Semantic Segmentation

Quantitative result for Protocol-I evaluation on Semantic Segmentation

Qualitative Results

todo



Protocol-II: Open-vocabulary Segmentation

Comparison with the previous ZS3 methods on PASCAL VOC.

The “Seen”, “Unseen”, and “Harmonic” denote mIoU of seen categories, unseen categories, and their harmonic mean. These ZS3 methods are trained on PASCAL-VOC training set.

Protocol-III: Domain Generalization

Performance for Domain Generalization between different datasets.

The Table presents the results for cross-dataset validation, which can evaluate the generalization of data. Compared with real data, DiffuMask show powerful effectiveness on domain generalization, e.g., 69.5% with DiffuMask v.s 68.0 with ADE20K on VOC 2012 val.

Ablation Study



Acknowledgements

Based on a template by Ziyi Li and Richard Zhang.