Active Preference Alignment
To be submitted to ICML 2026
Problem Overview
Current AI alignment paradigms adopt a corrective approach. Instead of embedding safety in intelligence's substrate, these methods treat alignment as an adversarial, post-hoc stage. They sculpt outputs only after capabilities solidify. I contend this disjoint method relies on static proxy objectives. Training a single reward model to approximate monolithic human preferences fails to capture values' dynamic, partially-observable nature. The optimization process consequently becomes a game of exploiting this static proxy; policies push against divergence constraints to maximize scores. The result is a fragile alignment subject to reward misspecification and over-optimization, with models exploiting proxy metrics instead of internalizing underlying intent.
Although these empirical methods drive rapid benchmark progress, treating alignment as purely corrective risks nonlinear scaling issues. Models already show emergent capabilities breaking training assumptions. To manage this, we must evolve beyond post-hoc fine-tuning and build alignment directly into the generative process.
This conviction defines my research objective: building inherently aligned, natively steerable systems attuned to dynamic human preferences. My current research project stems from this principle. To overcome reward over-optimization and diversity loss in fine-tuning, I developed a theoretically grounded, training-free alignment method based on Sequential Monte Carlo sampling and Feynman-Kac Correctors. This framework rethinks alignment by integrating online preference signals directly into the latent generative process through stochastic dynamics. Unlike classifier guidance requiring specific discriminators, this method decouples the generative prior from the value-alignment likelihood. We guide generation with reward-tilted perturbations in real time, requiring only incremental, sparse, and partially-observable feedback from users with unknown preferences.
Prepared for submission to ICML 2026, this solves alignment's optimal transport problem: shifting the pre-trained distribution to preferences while minimizing generative manifold distortion. This method provides principled control over the transport map, enabling fine-grained steering without retraining. In image generation with high-dimensional latent spaces, non-convex reward landscapes, and probabilistic transformations, the approach demonstrates robust convergence to high-reward regions while preserving distributional diversity. It maintains geometric coherence and semantic meaning, avoiding mode collapse and reward hacking. This proves a new paradigm where steerability is inherent.
Abstract
Diffusion models are highly effective at modeling complex data distributions, including images and text. However, in applications like personalized recommender systems, the objective often shifts to modeling specific regions of the distribution that maximize user preferences—initially unknown but gradually uncovered through interactive feedback. This can naturally be framed as a reinforcement learning problem, where the goal is to fine-tune a diffusion model to maximize a reward function based on preferences. However, the main challenge lies in learning a parameterized reward model, which typically requires large-scale preference data—something that is often not feasible in practice. In this work, we introduce a novel framework that bypasses the requirement for a pretrained reward model by directly optimizing the dynamics of the reverse diffusion process using real-time user feedback. Our framework enables feedback-efficient preference alignment, drawing inspiration from the Feynman-Kac-based Fokker-Planck framework. We demonstrate our framework's effectiveness through extensive experiments and ablation studies across diverse domains. Additionally, based on theoretical insights, we propose an enhanced fine-tuning strategy that requires less computational budget and accelerates the fine-tuning process, further boosting its suitability for real-world deployment.
Introduction
Diffusion models are powerful deep generative frameworks that synthesize data by reversing a diffusion process, enabling them to capture complex distributions such as natural image manifolds. Yet, in applications like personalized product recommendation, the goal shifts to steering generation toward items that match individual user preferences—preferences that gradually emerge from user interactions. Similar challenges occur in other domains. For example, diffusion models trained on large internet datasets are often used for image generation, but practical use cases demand outputs with specifically desired attributes, such as high aesthetic quality. Comparable situations arise in drug discovery, where generation must be guided toward molecules with strong bioactivity. These tasks can be formulated as reinforcement learning (RL) problems, where the diffusion model is fine-tuned to maximize a reward function encoding target properties or user preferences. However, RL-based methods typically require substantial preference data to learn accurate reward models, making them impractical for settings like personalized recommendation systems, where user feedback is limited and expensive to collect interactively.
The challenge is twofold: Firstly, achieving this objective requires efficient exploration. However, in high-dimensional spaces, such as those of natural images, this goes beyond simply discovering new regions. It also necessitates respecting the structural constraints of the problem. For instance, in areas like product recommendation, valid solutions—such as realistic-looking products—are typically confined to a lower-dimensional manifold within a much larger design space. Therefore, an effective, feedback-efficient fine-tuning method must explore this space while staying within the feasible area, as venturing outside would lead to wasteful invalid queries. Moreover, fine-tuning the diffusion model to aggressively optimize based on the preferences collected so far can reduce sample diversity. This is because human preferences are often multimodal, and the model, if overly focused on a narrow set of preferences, may fail to capture the full spectrum of diverse user preferences, leading to a less varied sample generation. Therefore, efficient exploration is crucial to maintaining the quality of generated samples and ensuring greater diversity in sample generation.
Secondly, a key challenge in many applications is the high cost of acquiring feedback for the ground-truth reward function. For instance, in a product recommendation system, determining user preferences requires subjective human judgment, which is both costly and time-consuming. This challenge is further compounded by the need for the model to not only explore new options but also to exploit the information it has gathered to generate samples that align with the user's preferences. If the model continues to explore without producing samples that meet the user's expectations, it risks disengaging the user. In a nutshell, the model must strike a balance between exploration and exploitation—effectively generating preference-aligned samples while minimizing costly reward queries. While several recent works have proposed RL-based fine-tuning methods for diffusion models, none directly tackle the challenge of feedback efficiency in an online setting. Uehera et. al. introduced a framework that accounts for the online nature of feedback but still relies on a separate parameterized reward model for optimization. Our goal instead is to develop a feedback-efficient online fine-tuning approach that entirely eliminates the need for a separate pre-trained reward model, instead directly leveraging inference time scaling of a base diffusion model using real-time user feedback.
Framework Architecture
Our framework integrates online preference signals directly into the latent generative process through stochastic dynamics, rethinking alignment by decoupling the generative prior from the value-alignment likelihood. The architecture consists of several interconnected components working together to achieve feedback-efficient preference alignment.
System Components
Base Generative Model
Stable Diffusion 1.5 (SD15) serves as the pre-trained diffusion model. The model operates in a latent space with shape (4, 64, 64) using a VAE scale factor of 0.18215. The UNet provides score functions for the reverse diffusion process, and we use the EulerAncestralDiscreteScheduler for timestep management.
Feynman-Kac Corrector (FKC)
The core alignment mechanism that modifies the reverse diffusion process. At each timestep t, it computes a modified drift term that incorporates reward gradients and diversity terms, enabling real-time steering without model retraining.
Surrogate Reward Model
A lightweight neural network (LatentSurrogate) that learns to approximate user preferences from sparse feedback. Architecture: Conv2d(4→128) → AdaptiveAvgPool2d → MLP(128→256→128→1). Trained online using Adam optimizer (lr=1e-3) on historical preference data.
Preference Proxy
OpenCLIP (ViT-L-14) provides preference scoring by computing image-text similarity. Supports multi-user preferences through a prompt bank system. Uses temperature-scaled softmax (temperature=10.0) to convert similarities to preference scores.
Sequential Monte Carlo
Maintains a particle ensemble with importance weights w. Particles are tracked through the diffusion process, with weights updated based on reward signals and drift interactions. Historical particles are maintained for diversity computation.
Diversity Mechanism
Computes normalized L2 squared distance gradients between particles to prevent mode collapse. Can incorporate historical particles for long-term diversity. Controlled by a binary flag and gamma schedule parameter.
Architecture Flow
Generate initial latents from prompt-conditioned SD15, add noise to match initial timestep
Detailed Algorithm
Standard Denoising Step
For each timestep t, compute the baseline score function score = -ε/σ_t using the UNet. Perform a standard denoising step to obtain a clean latent estimate z_t_clean using Tweedie's formula: z_0_estimate = z_t + σ_t² · score. This clean state is used for reward computation as it matches the reward model's training distribution.
Reward Gradient Computation
Compute reward gradients on the clean latent: r_grad = ∇_z r(z_t_clean) where r is the surrogate reward model. The gradient is computed via automatic differentiation, enabling end-to-end gradient flow through the reward network.
Diversity Gradient Computation
Compute diversity loss gradient: div_grad = ∇_z L_div(z_t_clean, historical_particles) where L_div is the normalized L2 squared distance loss. The gradient encourages particles to maintain distance from each other and historical samples, preventing mode collapse. The combined gradient is: combined_r_grad = r_grad + γ_t · div_grad.
FKC Drift Construction
Construct the FKC-modified drift term on the original noisy latent z_t:
where f_t is the baseline drift derived from the scheduler. Convert to noise prediction format: noise_pred_fkc = -drift_fkc / σ_t and apply via scheduler step.
Weight Update
Update particle weights using the Feynman-Kac weight equation:
Weights are clamped to [-100, 100] during updates, then normalized to [0, 1] at the end of simulation.
Feedback Collection & Model Update
After FKC simulation, decode latents to images, score using OpenCLIP, and update the surrogate reward model. Training uses batched Adam optimization (batch_size=32) over all historical data for 200 epochs per iteration.
Experimental Setup
Hardware & Device
- Device: CUDA-enabled GPU (automatic fallback to CPU)
- Memory Optimization: Attention slicing, VAE slicing, VAE tiling enabled
- Batch Processing: Configurable batch sizes for memory efficiency (default: 8 for generation, 32 for training)
Model Configuration
- Base Model: Stable Diffusion 1.5 (runwayml/stable-diffusion-v1-5)
- Scheduler: EulerAncestralDiscreteScheduler
- Latent Shape: (4, 64, 64) channels × height × width
- VAE Scale Factor: 0.18215
- Precision: FP16 on CUDA, FP32 on CPU
Hyperparameters
- n_particles: Number of particles in ensemble (default: 32)
- n_steps: Diffusion timesteps (default: 25)
- k_observe: Number of particles to observe per iteration (default: 8)
- B: Total feedback budget (default: 80)
- temperature: OpenCLIP softmax temperature (default: 10.0)
Schedule Parameters
- β schedule: Linear from β_min=0.5 to β_max=2.0 over total steps
- γ schedule: Linear from γ_max=0.05 to γ_min=0.0 over total steps
- β_dot: Rate of change of beta (default: 1.0)
- diversity_enabled: Binary flag (default: False in image experiments)
Training Configuration
- Optimizer: Adam with lr=1e-3
- Training Epochs: 200 per iteration
- Batch Size: 32 for reward model training
- Loss Function: Mean squared error between predicted and observed rewards
Preference System
- CLIP Model: OpenCLIP ViT-L-14 (openai pretrained)
- Scoring: Temperature-scaled softmax over prompt bank similarities
- Multi-user Support: Union scoring (max over user preferences)
- Default Users: apple, grape, banana preferences
Optimal Transport Problem
Shifts the pre-trained distribution to preferences while minimizing generative manifold distortion. Provides principled control over the transport map, enabling fine-grained steering without retraining. The FKC drift term directly implements the transport map modification.
Training-Free Alignment
Eliminates the need for a separate pre-trained reward model. The surrogate model learns online from sparse feedback, while the base diffusion model remains frozen. Alignment occurs entirely at inference time through FKC modifications to the reverse diffusion process.
Diversity Preservation
Maintains geometric coherence and semantic meaning, avoiding mode collapse and reward hacking. The diversity gradient term encourages particle separation in latent space, while historical particle tracking ensures long-term diversity. Demonstrates robust convergence to high-reward regions while preserving distributional diversity.
Mathematical Foundation
The framework solves the alignment problem through the Feynman-Kac formalism:
- Decoupling: Generative prior p(z) (from SD15) is separated from value-alignment likelihood L(r|z) (from surrogate model)
- FKC Drift: dz = [σ_t²(score + (β_t/2)(∇r + γ_t·∇div)) - f_t] dt + σ_t dW_t
- Weight Evolution: dw = (∂β_t/∂t)·r·dt - ⟨β_t·∇(r+γ_t·∇div), f_t⟩·dt + ⟨β_t·∇(r+γ_t·∇div), (σ_t²/2)·score⟩·dt
- Sequential Monte Carlo: Particles z_i with weights w_i are evolved through the SDE, with resampling implicit in weight updates
- Diversity Loss: L_div(z_i) = (1/d)·(1/(n-1))·∑_{j≠i} ||z_i - z_j||² normalized by latent dimension d
- Reward Approximation: Surrogate model r_θ(z) learns from historical feedback pairs (z, r_true) via MSE loss