ECCV 2026 Tutorial · September 8 (PM)

Post-Training Diffusion Models: Enhancing Capabilities, Control, and Alignment

Sayak Paul¹ Linoy Tsaban¹ Hila Chefer²

¹Hugging Face ²Black Forest Labs

Half-day tutorial at the European Conference on Computer Vision (ECCV) 2026 — second half of the day. Exact event location and time will be updated here.

Contact Slides & Recording (coming soon)

Abstract

Pre-trained generative models are built using massive, typically unlabeled corpora, enabling them to capture broad, generic knowledge across diverse domains. However, at inference time, we often aim to adjust and customize these models — to exert control, enhance specific capabilities, and align their behavior with user intent and preferences. Post-training techniques have therefore emerged as both a practical necessity and an accessible means of adapting these powerful, yet static, models. This tutorial surveys the state-of-the-art in post-training methods for diffusion models, analyzing their strengths, limitations, and areas of application. We conclude with a critical discussion on the boundaries of post-training — asking whether fundamental semantic malfunctions can truly be resolved without revisiting the pretraining process.

Tutorial Outline

We start by motivating the gap between pre-training and post-training, drawing on seminal work such as DDPM [1] and Latent Diffusion Models [2]. We then cover three broad themes — Alignment, Enhancement, and Control — each complemented by implementation references and noteworthy literature for further study.

Part 1

Pre-Training vs. Post-Training

A brief discussion on common pre-training methodologies of text-to-image diffusion models, and the practical motivation for post-training large pre-trained models. Sets up the three themes that organize the rest of the tutorial.

Part 2

Alignment

What alignment means within the diffusion paradigm, and how it is implemented in contemporary base models. Three core dimensions:

Definition: alignment with preferences and human values.
Safety, fairness, copyright: interpretability and concept inspection [3,4]; diffusion forgery and concept deletion [5,6].
RLHF & preference alignment: RL for diffusion [7], DPO [8], MaPO [9], online RL for flow matching [10].

Part 3

Enhancement

Improving capabilities beyond the original training scope, plus distillation to reduce denoising steps. Covers both inference-only and training-based techniques:

Inference time: StyleAlign [11], Attend-and-Excite [12], PAG [13], CFG-Zero* [14], HeadHunter [15], inference-time scaling [16,17,18].
Training: fine-tuning for lighting, realism, style, faces [19,20]; distillation for step and CFG reduction [21,22].

Part 4

Control

Structural control beyond natural language, plus emerging in-context-learning approaches for image generation:

Inference time: OmniGen [23] and related architectures.
Training: OminiControl [24], Ctrl-Adapter [25], Flux Control [26], ControlNet [27].

Format & Schedule

A half-day event in the second half of September 8, 2026. Indicative breakdown:

Part	Topic	Duration
Part 1	Pre-training vs. post-training — motivation and overview	45 min
Part 2	Alignment — definitions, safety, fairness, RLHF	45 min
Break	Coffee & networking	15 min
Part 3	Enhancement — inference-time and training-based techniques	45 min
Part 4	Control — structural control and in-context learning	45 min
Q&A	Open discussion on the boundaries of post-training	15 min

Speakers

Sayak Paul

Hugging Face

Research Engineer at Hugging Face working on image and video generation, with a focus on controllability and inference-time scaling for diffusion models. His research includes ReflectionFlow [28] (reflection in diffusion models) and MaPO [9] (reference-mismatch in diffusion alignment). Co-maintains the Diffusers library.

Website

Linoy Tsaban

Hugging Face

Machine Learning Engineer at Hugging Face working at the intersection of AI and the arts, with a focus on image and video diffusion models. She contributes fine-tuning and inference-time control pipelines to Diffusers, and co-authored LEDITS and LEDITS++ (spotlight at NeurIPS'23; CVPR'24). She has organized a generative AI meetup attended by over 500 people.

Website

Hila Chefer

Black Forest Labs

Researcher in the Fundamental Research team at Black Forest Labs and incoming Assistant Professor at Tel Aviv University (starting October 2026). Her work focuses on understanding, interpreting, and controlling deep foundational models, including transformer explainability, attention-based semantic guidance for diffusion models (Attend-and-Excite [12]), Google's Lumiere video model, and Meta's VideoJAM training framework. During her PhD at Tel Aviv University, she was a visiting researcher at Google Research, Google DeepMind, and Meta AI.

Website

References

Ho, J., Jain, A., Abbeel, P. Denoising diffusion probabilistic models. NeurIPS 2020.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B. High-resolution image synthesis with latent diffusion models. CVPR 2022.
Chefer, H., et al. Concept representation and inspection in diffusion models. 2024.
Carlini, N., et al. Extracting training data from diffusion models. 2023.
Somepalli, G., et al. Diffusion art or digital forgery? Investigating data replication in diffusion models. CVPR 2023.
Gandikota, R., et al. Erasing concepts from diffusion models. ICCV 2023.
Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S. Training diffusion models with reinforcement learning. ICLR 2024.
Wallace, B., et al. Diffusion model alignment using direct preference optimization. CVPR 2024.
Hong, J., Paul, S., Lee, N., Rasul, K., Thorne, J., Jeong, J. Margin-aware preference optimization for aligning diffusion models without reference. 2024.
Liu, J., et al. Flow-GRPO: Training flow matching models via online RL. 2025.
Hertz, A., Voynov, A., Fruchter, S., Cohen-Or, D. Style aligned image generation via shared attention. CVPR 2024.
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM SIGGRAPH 2023.
Ahn, D., et al. Self-rectifying diffusion sampling with perturbed-attention guidance. ECCV 2024.
Fan, W., Zheng, A.Y., Yeh, R.A., Liu, Z. CFG-Zero*: Improved classifier-free guidance for flow matching models. 2025.
Ahn, D., et al. HeadHunter. 2025.
Ma, N., et al. Inference-time scaling for diffusion models. 2025.
Singhal, R., et al. Inference-time scaling for diffusion models. 2025.
Zhuo, L., et al. Inference-time scaling for diffusion models. 2025.
Zhang, L., et al. Fine-tuning diffusion models for improved generation quality. 2025.
Li, Y., et al. Fine-tuning diffusion models for improved generation quality. 2025.
Sauer, A., et al. Adversarial diffusion distillation. CVPR 2024.
Chen, X., et al. Distillation for diffusion models. 2025.
Xiao, S., et al. OmniGen: Unified image generation. 2024.
Tan, Z., et al. OminiControl: Minimal and universal control for diffusion transformer. 2024.
Lin, H., et al. Ctrl-Adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model. 2024.
Black Forest Labs. Flux Control. 2024.
Zhang, L., Rao, A., Agrawala, M. Adding conditional control to text-to-image diffusion models. ICCV 2023.
Zhuo, L., Zhao, L., Paul, S., et al. From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning. ICCV 2025.