08/26/2025
Recent advances in diffusion-based image generation have enabled more diverse, high-quality image generation, opening new possibilities in game development, filmmaking, and advertising. However, these tasks often require precise control over the generation process to meet specific artistic, narrative, or branding goals. This demands conditioning inputs such as text instructions, reference images, or visual attributes, which require training data that accurately reflect image-condition associations. Existing training data creation approaches, including manual annotation, data re-purposing, and prompt engineering, offer some utility but face notable limitations in scalability, robustness, and quality, ultimately constraining resulting models' capabilities.
In response, this talk presents our research on automated training data creation methods for enabling and improving instruction-guided and attribute-based image editing with diffusion models, explored from two directions: refining existing datasets and developing evaluation models to guide fine-tuning.
For instruction-guided image editing, we identify semantic misalignment between text instructions and before/after image pairs as a major limitation in current training datasets. We then propose a self-supervised method to detect and correct this misalignment, improving editing quality after fine-tuning on the corrected samples.
Additionally, we note that existing evaluation metrics often rely on models with limited semantic understanding. To address this, we fine-tune vision-language models as robust evaluators using high-quality synthetic data. These evaluators can also act as reward models to guide editing model training via reinforcement learning.
Extending this framework, we explore attribute-based editing with novel visual attributes. We introduce a web-crawling pipeline to curate samples for few-shot fine-tuning, enabling diffusion models to become attribute-aware. These models can generate diverse samples to train an attribute scorer which directs attribute-based editing.
Finally, we apply our methods to applications such as virtual try-on and reference- or stroke-guided editing by introducing new conditioning mechanisms within diffusion models. Together, these contributions enable scalable, high-quality training data generation for diffusion-based conditional image editing, which improves model performance, controllability, and generalization.
https://ucsb.zoom.us/j/81715448696