The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL
Nicolas Beltran-Velez, Felix Friedrich, Zhang Xiaofeng, Reyhane Askari-Hemmat, Xiaochuang Han, Adriana Romero-Soriano
- Published
- Jun 17, 2026 — 15:01 UTC
Problem
The paper addresses a gap in the effectiveness of score- and flow-matching models that utilize preference-based reinforcement learning (RL) for aligning with subjective preferences and recovering visual realism and coherent object structures. The authors argue that existing matching losses, which measure $\ell_2$ regression error on velocity or score fields, poorly align with the visual and semantic properties that determine sample quality during inference. This structural mismatch leads to suboptimal performance in generating high-quality samples. The work is presented as a preprint, indicating it has not yet undergone peer review.
Method
The authors propose Discriminator-Guided RL (DRL), which involves training a discriminator to distinguish between real data and samples generated by a base model within a pretrained representation space. The discriminator’s logit is then used as a reward signal in a KL-regularized RL framework. This approach ensures that the discriminator focuses on perceptually meaningful directions, allowing the logit to estimate the log-likelihood ratio between real data and model samples, which serves as an optimal reward for targeting the data distribution. The training process leverages existing pretrained models, enhancing the efficiency of the reward signal without the need for expensive human preference annotations.
Results
DRL demonstrates significant improvements across multiple benchmarks, including SiT, JiT, REPA, and RAE. Notably, it reduces guidance-free Fréchet Inception Distance (FID) from 9.38 to 2.62 on SiT and semantic-space Fréchet Distance (FD) from 88.2 to 19.3 on DINOv3 for SiT. These results indicate consistent performance gains across various backbone architectures. Additionally, DRL enhances human-preference rewards without requiring training on preference data, and it improves the Pareto frontier between preference reward and image fidelity during subsequent preference-based post-training, effectively reducing low-level artifacts such as oversaturation and excessive brightness.
Limitations
The authors acknowledge that while DRL improves alignment with perceptual properties, it may still be limited by the quality of the pretrained representation space and the discriminator’s capacity to generalize across diverse datasets. Furthermore, the reliance on KL-regularization may introduce additional hyperparameter tuning challenges. The paper does not address potential scalability issues when applying DRL to larger datasets or more complex models.
Why it matters
The introduction of DRL has significant implications for the development of generative models, particularly in enhancing the quality of generated samples without the need for extensive human annotations. By aligning the reward structure more closely with perceptual properties, this approach could lead to more robust and realistic generative models, facilitating advancements in applications such as image synthesis and style transfer. This work contributes to the ongoing discourse on improving generative modeling techniques, as published in arXiv.
By Callan Zhang · Jun 17, 2026 · Editorial standards →
Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.
Source: arXiv cs.CV