Weijia Wu
Weijia Wu is currently a research fellow at (Show Lab), National University of Singapore, working with Prof. Mike Z. Shou. I received my PhD from Zhejiang University. Additionally, I am also a popper; I have been dancing popping for five years. In my spare time, I also enjoy playing basketball and swimming.
Email  / 
Google Scholar  / 
GitHub  / 
Twitter  / 
|
|
Research Interests
My current research interests primarily focus on AI research for camera-based video production, including Video Generation/editing, controllable video generation, and long video generation.
Notes:
Any form of talks and collaboration (job opportunities) is welcomed. Feel free to contact me by Email(weijiawu96@gmail.com).
|
Video Text Detection/Spotting: TransDETR (IJCV 2024), DSText (ICDAR 2023 & PR 2024), BOVText (NeurIPS 2021, Dataset Track)
Video Retrieval: TextVR (PR 2023)
Synthetic Data for Perception Tasks: DiffuMask (ICCV 2023), DatasetDM (NeurIPS 2023)
Image Generation: ParaDiffusion (Arxiv, Nov., 2023)
Video Generation: DragAnything (ECCV 2024), Awesome-Video-Diffusion (3.4k stars)
|
Recent Updates
[Sep. 2024]: One paper (ZipCache) got accepted in NeurIPS 2024!
[Jul. 2024]: One paper (TextVR) got accepted in Pattern Recognition 2024!
[Jul. 2024]: Three papers (DragAnything, MotionDirector (Oral), ControlCap) got accepted in ECCV 2024!
[Mar. 2024]: One paper(TransDETR) got accepted in IJCV!
[Feb. 2024]: One paper (DiverGen) got accepted in CVPR 2024!
[Jan. 2024]: Got my Ph.D. degree from Zhejiang University!
[Jan. 2024]: One paper (EfficientDM) got accepted in ICLR 2024 (spotlight)!
[Nov. 2023]: One paper (DSText V2) got accepted in Pattern Recognition 2024!
[Nov. 2023]: One paper (CisDQ) got accepted in IEEE TCSVT 2023!
[Sep 2023]: Three papers (DatasetDM, Mix-of-Show, PTQD) got accepted in NeurIPS 2023!
[July 2023]: Three papers (DiffuMask, GenPromp, BiViT) got accepted in ICCV 2023!
[Dec. 2022]:We organize LOng-form VidEo Understanding and Generation Workshop & International Challenge @ CVPR'23!
[Dec. 2022]:We organize ICDAR2023 Video Text Reading Competition for Dense and Small Text!
[June 2022]: One paper got accepted in ICIP 2022!
[March 2022]:Serve as a reviewer for ICML2022.
[July 2021]: One paper got accepted in NeurIPS 2021!!
[June 2021]:Serve as a reviewer for NeurIPS2021.
|
Experience
[January 2021 - January 2022]:Research Intern at MMU, KuaiShou, led by Debing Zhang
[January 2022 - August 2023]:Research Intern at MMU, KuaiShou, led by Jiahong Li
[August 2022 - August 2023]:Visting PhD student at Show lab, NUS, led by Asst Prof. Mike Shou
[Dec. 2023 - Present]:Research Fellow at Show lab, NUS, work with Asst Prof. Mike Shou
|
|
DragAnything: Motion Control for Anything using Entity Representation
Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, Di Zhang.
The 18th European Conference on Computer Vision(ECCV 2024),
Abstract /
arXiv /
BibTex /
GitHub /
Project Page /
Twitter(X) /
We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video
generation. Comparison to existing motion control methods, DragAnything offers several advantages. Firstly, trajectory-based is more userfriendly for interaction, when acquiring other guidance signals (e.g.,
masks, depth maps) is labor-intensive. Users only need to draw a line (trajectory) during interaction. Secondly, our entity representation serves as
an open-domain embedding capable of representing any object, enabling
the control of motion for diverse entities, including background. Lastly,
our entity representation allows simultaneous and distinct motion control for multiple objects. Extensive experiments demonstrate that our
DragAnything achieves state-of-the-art performance for FVD, FID, and
User Study, particularly in terms of object motion control, where our
method surpasses the previous methods (e.g., DragNUWA) by 26% in
human voting.
@article{wu2024draganything,
title={DragAnything: Motion Control for Anything using Entity Representation},
author={Wu, Wejia and Li, Zhuang and Gu, Yuchao and Zhao, Rui and He, Yefei and Zhang, David Junhao and Shou, Mike Zheng and Li, Yan and Gao, Tingting and Zhang, Di},
journal={The 18th European Conference on Computer Vision (ECCV 2024)},
year={2024}
}
|
|
EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models
Yefei He, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang.
The Twelfth International Conference on Learning Representations(ICLR 2024 spotlight),
Abstract /
arXiv /
BibTex /
Diffusion models have demonstrated remarkable capabilities in image synthesis and related generative tasks. Nevertheless, their practicality for low-latency
real-world applications is constrained by substantial computational costs and latency issues. Quantization is a dominant way to compress and accelerate diffusion
models, where post-training quantization (PTQ) and quantization-aware training
(QAT) are two main approaches, each bearing its own properties. While PTQ exhibits efficiency in terms of both time and data usage, it may lead to diminished
performance in low bit-width settings. On the other hand, QAT can help alleviate performance degradation but comes with substantial demands on computational and data resources. To capitalize on the advantages while avoiding their respective drawbacks, we introduce a data-free, quantization-aware and parameterefficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM,
to achieve QAT-level performance with PTQ-like efficiency. Specifically, we propose a quantization-aware variant of the low-rank adapter (QALoRA) that can
be merged with model weights and jointly quantized to low bit-width. The finetuning process distills the denoising capabilities of the full-precision model into
its quantized counterpart, eliminating the requirement for training data. To further
enhance performance, we introduce scale-aware optimization to address ineffective learning of QALoRA due to variations in weight quantization scales across
different layers. We also employ temporal learned step-size quantization to handle
notable variations in activation distributions across denoising steps. Extensive experimental results demonstrate that our method significantly outperforms previous
PTQ-based diffusion models while maintaining similar time and data efficiency.
Specifically, there is only a marginal 0.05 sFID increase when quantizing both
weights and activations of LDM-4 to 4-bit on ImageNet 256 × 256. Compared
to QAT-based methods, our EfficientDM also boasts a 16.2× faster quantization
speed with comparable generation quality, rendering it a compelling choice for
practical applications.
@article{he2023efficientdm,
title={EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models},
author={He, Yefei and Liu, Jing and Wu, Weijia and Zhou, Hong and Zhuang, Bohan},
journal={The Twelfth International Conference on Learning Representations (ICLR 2024)},
year={2024}
}
|
|
PTQD: Accurate Post-Training Quantization for Diffusion Models†
Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang.
Thirty-seventh Conference on Neural Information Processing Systems(NeurIPS 2023),
Abstract /
arXiv /
BibTex /
Diffusion models have recently dominated image synthesis and other related generative tasks. However, the iterative denoising process is expensive in computations at inference time, making diffusion models less practical for low-latency and scalable real-world applications. Post-training quantization of diffusion models can significantly reduce the model size and accelerate the sampling process without requiring any re-training. Nonetheless, applying existing post-training quantization methods directly to low-bit diffusion models can significantly impair the quality of generated samples. Specifically, for each denoising step, quantization noise leads to deviations in the estimated mean and mismatches with the predetermined variance schedule. Moreover, as the sampling process proceeds, the quantization noise may accumulate, resulting in a low signal-to-noise ratio (SNR) in late denoising steps. To address these challenges, we propose a unified formulation for the quantization noise and diffusion perturbed noise in the quantized denoising process. We first disentangle the quantization noise into its correlated and residual uncorrelated parts regarding its full-precision counterpart. The correlated part can be easily corrected by estimating the correlation coefficient. For the uncorrelated part, we calibrate the denoising variance schedule to absorb the excess variance resulting from quantization. Moreover, we propose a mixed-precision scheme to choose the optimal bitwidth for each denoising step, which prefers low bits to accelerate the early denoising steps while high bits maintain the high SNR for the late steps.
@article{he2023ptqd,
title={PTQD: Accurate Post-Training Quantization for Diffusion Models},
author={He, Yefei and Liu, Luping and Liu, Jing and Wu, Weijia and Zhou, Hong and Zhuang, Bohan},
journal={Proc. Advances In Neural Information Processing Systems (NeurIPS 2023)},
year={2023}
}
|
|
DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models
Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, Chunhua Shen.
Thirty-seventh Conference on Neural Information Processing Systems(NeurIPS 2023),
Abstract /
arXiv /
BibTex /
GitHub /
Project Page /
中文公众号报道 /
Current deep networks are very data-hungry and benefit from training on largescale datasets, which are often time-consuming to collect and annotate. By contrast, synthetic data can be generated infinitely using generative models such as DALL-E and diffusion models, with minimal effort and cost. In this paper, we present DatasetDM, a generic dataset generation model that can produce diverse synthetic images and the corresponding high-quality perception annotations (e.g., segmentation masks, and depth). Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation. We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module. Training the decoder only needs less than 1% (around 100 images) manually labeled images, enabling the generation of an infinitely large annotated dataset. Then these synthetic data can be used for training various perception models for downstream tasks. To showcase the power of the proposed approach, we generate datasets with rich dense pixel-wise labels for a wide range of downstream tasks, including semantic segmentation, instance segmentation, and depth estimation. Notably, it achieves 1) state-of-the-art results on semantic segmentation and instance segmentation; 2) significantly more robust on domain generalization than using the real data alone; and state-of-the-art results in zero-shot segmentation setting; and 3) flexibility for efficient application and novel task composition (e.g., image editing).
@article{wu2023datasetdm,
title={DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models},
author={Wu, Weijia and Zhao, Yuzhong and Chen, Hao and Gu, Yuchao and Zhao, Rui and He, Yefei and Zhou, Hong and Shou, Mike Zheng and Shen, Chunhua},
journal={Proc. Advances In Neural Information Processing Systems (NeurIPS 2023)},
year={2023}
}
|
|
BiViT: Extremely Compressed Binary Vision Transformers
Yefei He,Zhenyu Lou,,Luoming Zhang Weijia Wu, Bohan Zhuang, Hong Zhou.
International Conference on Computer Vision Conference (ICCV2023),
Abstract /
arXiv /
BibTex /
Model binarization can significantly compress model size, reduce energy consumption, and accelerate inference through efficient bit-wise operations. Although binarizing convolutional neural networks have been extensively studied, there is little work on exploring binarization on vision Transformers which underpin most recent breakthroughs in visual recognition. To this end, we propose to solve two fundamental challenges to push the horizon of Binary Vision Transformers (BiViT). First, the traditional binary method does not take the long-tailed distribution of softmax attention into consideration, bringing large binarization errors in the attention module. To solve this, we propose Softmax-aware Binarization, which dynamically adapts to the data distribution and reduces the error caused by binarization. Second, to better exploit the information of the pretrained model and restore accuracy, we propose a Cross-layer Binarization scheme and introduce learnable channel-wise scaling factors for weight binarization. The former decouples the binarization of self-attention and MLP to avoid mutual interference while the latter enhances the representation capacity of binarized models. Overall, our method performs favorably against state-of-the-arts by 19.8% on the TinyImageNet dataset. On ImageNet, BiViT achieves a competitive 70.8% Top-1 accuracy over Swin-T model, outperforming the existing SOTA methods by a clear margin.
@InProceedings{he2022bivit,
title={Bivit: Extremely compressed binary vision transformer},
author={He, Yefei and Lou, Zhenyu and Zhang, Luoming and Wu, Weijia and Zhuang, Bohan and Zhou, Hong},
booktitle = {International Conference on Computer Vision Conference 2023},
year = {2023}
}
|
|
Generative Prompt Model for Weakly Supervised Object Localization
Yuzhong Zhao,Qixiang Ye, Weijia Wu, Chunhua Shen, Fan Wan.
International Conference on Computer Vision Conference (ICCV2023),
Abstract /
arXiv /
BibTex /
GitHub /
Weakly supervised object localization (WSOL) remains challenging when learning object localization models from image category labels. Conventional methods that discriminatively train activation models ignore representative yet less discriminative object parts. In this study, we propose a generative prompt model (GenPromp), defining the first generative pipeline to localize less discriminative object parts by formulating WSOL as a conditional image denoising procedure. During training, GenPromp converts image category labels to learnable prompt embeddings which are fed to a generative model to conditionally recover the input image with noise and learn representative embeddings. During inference, enPromp combines the representative embeddings with discriminative embeddings (queried from an off-the-shelf vision-language model) for both representative and discriminative capacity. The combined embeddings are finally used to generate multi-scale high-quality attention maps, which facilitate localizing full object extent. Experiments on CUB-200-2011 and ILSVRC show that GenPromp respectively outperforms the best discriminative models by 5.2% and 5.6% (Top-1 Loc), setting a solid baseline for WSOL with the generative model.
@InProceedings{zhao2023GPM,
author = {Yuzhong Zhao, Qixiang Ye, Weijia Wu, Chunhua Shen, Fang Wan},
title = {Generative Prompt Model for Weakly Supervised Object Localization},
booktitle = {International Conference on Computer Vision Conference 2023},
year = {2023}
}
|
|
DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models
Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou, Chunhua Shen.
International Conference on Computer Vision Conference (ICCV2023),
Abstract /
arXiv /
BibTex /
GitHub /
Project Page /
Collecting and annotating images with pixel-wise labels is time-consuming and laborious. In contrast, synthetic data can be freely available using a generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model, which uses only text-image pairs during training. Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image, which is natural and seamless to extend the text-driven image synthesis to semantic mask generation. DiffuMask uses text-guided cross-attention information to localize class/word-specific regions, which are combined with practical techniques to create a novel high-resolution and class-discriminative pixel-wise mask. The methods help to reduce data collection and annotation costs obviously. Experiments demonstrate that the existing segmentation methods trained on synthetic data of DiffuMask can achieve a competitive performance over the counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird), DiffuMask presents promising performance, close to the stateof-the-art result of real data (within 3% mIoU gap). Moreover, in the open-vocabulary segmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on Unseen class of VOC 2012.
@InProceedings{wu2023DiffuMask,
author = {Wu, Weijia and Yuzhong, Zhao and Mike Zheng, Shou and Hong Zhou and Shen, Chunhua},
title = {DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models},
booktitle = {arxiv},
year = {2023}
}
|
|
End-to-End Video Text Spotting with Transformer
Weijia Wu, Yuanqiang Cai, Chunhua Shen, Debing Zhang, Ying Fu, Ping Luo,
Hong Zhou .
International Journal of Computer Vision (IJCV),
Abstract /
arXiv /
IJCV /
BibTex /
GitHub /
Youtube Demo /
ZhiHu /
Video text spotting(VTS) is the task that requires simultaneously detecting, tracking and recognizing text in video. Recent methods typically follow tracking-by-match paradigm and develop sophisticated pipelines to tackle this task.
In this paper, rooted in Transformer sequence modeling, we propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework (TransDETR). TransDETR mainly includes two advantages: 1) Different from the explicit match paradigm in the adjacent frame,
TransDETR tracks and recognizes each text implicitly by the different query termed `text query' over long-range temporal sequence (more than 7 frames). 2) TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (text detection, tracking, recognition).
Extensive experiments on four video text datasets (ICDAR2013 Video, ICDAR2015 Video, Minetto, and YouTube Video Text) are conducted to demonstrate that \detr achieves the state-of-the-art performance with up to around 7.0% improvements on detection, tracking, and spotting tasks.
@InProceedings{wu2022end,
author = {Weijia Wu, Yuanqiang Cai, Chunhua Shen, Debing Zhang, Ying Fu, Ping Luo, Hong Zhou},
title = {End-to-End Video Text Spotting with Transformer},
booktitle = {International Journal of Computer Vision},
year = {2024}
}
|
|
Real-time End-to-End Video Text Spotter with Contrastive Representation Learning
Weijia Wu, Zhuang Li, Jiahong Li, Chunhua Shen, Size Li, Zhongyuan Wang, Ping Luo,
Hong Zhou .
arxiv (arxiv),
Abstract /
arXiv /
BibTex /
Code /
Youtube Demo /
Video text spotting(VTS) is the task that requires simultaneously detecting, tracking and recognizing text instances in the video. Existing video text spotting methods typically develop sophisticated pipelines and multiple models, which is no friend for real-time applications. Here we propose a real-time end-to-end video text spotter with Contrastive Representation learning (CoText). Our contributions are four-fold: 1) For the first time, we simultaneously address the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework.
2) Like humans, CoText tracks and recognizes texts by comprehending them, relating them each other with visual and semantic representations. 3) With contrastive learning, CoText models long-range dependencies and learning temporal information across multiple frames. 4) A simple, light-weight architecture is designed for effective and accurate performance, including GPU-parallel detection post-processing, CTCbased recognition head with Masked RoI, and track head with contrastive learning. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting IDF1 of 72.0% at 35.2 FPS on ICDAR2015video, with 10.5% and 26.2 FPS improvement
the previous best method.
@InProceedings{wu2022cotext,
author = {Wu, Weijia and Li, Zhuang and Li, Jiahong and Shen, Chunhua and Zhou, Hong and Li, Size and Luo, Ping},
title = {Real-time End-to-End Video Text Spotter with Contrastive Representation Learning},
booktitle = {arxiv},
year = {2022}
}
|
|
Polygon-free: Unconstrained Scene Text Detection with Box Annotations
Weijia Wu, Enze Xie, Ruimao Zhang, Wenhai Wang, Ping Luo
Hong Zhou .
International Conference on Information Processing (ICIP2022),
Abstract /
arXiv /
BibTex /
Code /
Although a polygon is a more accurate representation than an upright bounding box for text detection, the annotations of polygons are extremely expensive and challenging.
Unlike existing works that employ fully-supervised training with polygon annotations, this study proposes an unconstrained text detection system termed Polygon-free (PF), in which most existing polygon-based text detectors (e.g.,PSENet [33],DB [16]) are trained with only upright bounding box annotations. Our core idea is to transfer knowledge
from synthetic data to real data to enhance the supervision information of upright bounding boxes. This is made possible with a simple segmentation network, namely Skeleton Attention Segmentation Network (SASN), that includes three vital components (i.e., channel attention, spatial attention
and skeleton attention map) and one soft cross-entropy loss. Experiments demonstrate that the proposed Polygonfree system can combine general detectors (e.g., EAST, PSENet, DB) to yield surprisingly high-quality pixel-level results with only upright bounding box annotations on a
variety of datasets (e.g., ICDAR2019-Art, TotalText, ICDAR2015). For example, without using polygon annotations, PSENet achieves an 80.5% F-score on TotalText [3](vs. 80.9% of fully supervised counterpart), 31.1% better than training directly with upright bounding box annotations, and saves 80%+ labeling costs. We hope that
PF can provide a new perspective for text detection to reduce the labeling costs.
@InProceedings{wu2022polygon,
author = {Weijia Wu and Enze Xie and Ruimao Zhang and Wenhai Wang and Ping Luo and Hong Zhou},
title = {Polygon-free: Unconstrained Scene Text Detection with Box Annotations},
booktitle = {IInternational Conference on Information Processing},
year = {2022}
}
|
|
ECLIP: Efficient Contrastive Language-Image Pretraining via Ensemble Confidence Learning and Masked Language Modeling
Jue Wang, Haofan Wang, Weijia Wu, Jincan Deng
Debing Zhang .
ICML 2022 Pre-training Workshop,2022
Abstract /
arXiv /
BibTex /
While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces three challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third,
previous methods only leverage limited imagetext paired data, while ignoring richer singlemodal data, which may result in poor generalization to single-modal downstream tasks. In this
work, we propose Efficient Contrastive LanguageImage Pretraining (ECLIP) via Ensemble Confidence Learning and Masked Language Modeling. Specifically, We adaptively filter out noisy samples in the training process by means of Ensemble
Confidence Learning strategy, and add a Masked Language Modeling objective to utilize extra nonpaired text data. ECLIP achieves the state-of-theart performance on Chinese cross-modal retrieval
tasks with only 1/10 training resources compared with CLIP and WenLan, while showing excellent generalization to single-modal tasks including text retrieval and text classification.
@InProceedings{wang2021efficientclip,
author = {Wang, Jue and Wang, Haofan and Wu, Weijia and Deng, Jincan and Zhang, Debing},
title = {Efficientclip: Efficient cross-modal pre-training by ensemble confident learning and language modeling},
booktitle = {ICML 2022 Pre-training Workshop},
year = {2022}
}
|
|
A Bilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer
Weijia Wu, Yuanqiang Cai, Debing Zhang, Jiahong Li
Hong Zhou .
NeurIPS 2021 Track on Datasets and Benchmarks (NeurIPS),
2021
Abstract /
arXiv /
Homepage /
BibTex /
GitHub/
Demo /
中文公众号报道 /
感谢其他人的知乎解读
Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset(BOVText). There are four features for BOVText. Firstly, we provide 2,000+ videos with more than 1,750,000+ frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 30+ open categories with a wide selection of various scenarios, e.g., Life Vlog, Driving, Movie, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures live and communication. Besides, we propose an end-to-end video text spotting framework with Transformer, termed TransVTSpotter, which solves the multi-orient text spotting in video with a simple, but efficient attention-based query-key mechanism. It applies object features from the previous frame as a tracking query for the current frame and introduces a rotation angle prediction to fit the multiorient text instance. On ICDAR2015(video), TransVTSpotter achieves the state-of-the-art performance with 44.1% MOTA, 9 fps.
@InProceedings{wu2021opentext,
author = {Weijia Wu, Debing Zhang, Yuanqiang Cai, Sibo Wang, Jiahong Li, Zhuang Li, Yejun Tang, Hong Zhou},
title = {A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer},
booktitle = {35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks},
year = {2021}
}
|
|
Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text Detection in the Wild
Weijia Wu, Ning Lu, Enze Xie,
Hong Zhou .
Proceedings of the Asian Conference on Computer Vision (ACCV),
2020.
Abstract /
arXiv /
BibTex
Deep learning-based scene text detection can achieve preferable performance, powered with sufficient labeled training data. However, manual labeling is time consuming and laborious. At the extreme, the corresponding annotated data are unavailable. Exploiting synthetic data is a very promising solution except for domain distribution mismatches between synthetic datasets and real datasets. To address the severe domain distribution mismatch, we propose a synthetic-to-real domain adaptation method for scene text detection, which transfers knowledge from synthetic data (source domain) to real data (target domain). In this paper, a text self-training (TST) method and adversarial text instance alignment (ATA) for domain adaptive scene text detection are introduced. ATA helps the network learn domain-invariant features by training a domain classifier in an adversarial manner. TST diminishes the adverse effects of false positives~(FPs) and false negatives~(FNs) from inaccurate pseudo-labels. Two components have positive effects on improving the performance of scene text detectors when adapting from synthetic-to-real scenes. We evaluate the proposed method by transferring from SynthText, VISD to ICDAR2015, ICDAR2013. The results demonstrate the effectiveness of the proposed method with up to 10% improvement, which has important exploration significance for domain adaptive scene text detection.
@InProceedings{wu2020synthetic,
author = {Wu, Weijia and Lu, Ning and Xie, Enze and Wang, Yuxing and Yu, Wenwen and Yang, Cheng and Zhou, Hong},
title = {Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text Detection in the Wild},
booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
year = {2020}
}
|
Academic Service
Conference Review
International Conference on Machine Learning(ICML), 2022, 2023, 2024.
Neural Information Processing Systems(NeurIPS), 2021, 2022, 2023, 2024.
Track Datasets and Benchmarks of Neural Information Processing Systems(NeurIPS), 2021, 2022, 2023, 2024.
The Association for Computational Linguistics (ACL) 2024.
International Conference on Learning Representations(ICLR), 2023, 2024, 2025.
The Association for the Advancement of Artificial Intelligence(AAAI), 2025.
IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024.
European Conference on Computer Vision (ECCV), 2024.
CVPR 2024 Workshop SyntaGen.
Journal Review
IEEE Transactions on Neural Networks and Learning Systems (TNNLS).
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT).
International Journal of Computer Vision (IJCV).
Transactions on Multimedia Computing Communications and Applications (ACM TOMM).
|
Awards
[10/2020] National Scholarship of China.
|
Talks
[10/2021] MMU, Kuaishou. Video Text Spotting.
|
|
|