OneReward: Multi Task Visual RLHF

OneReward: Multi-Task Visual RLHF

TL;DR

OneReward is a new reinforcement learning from human feedback framework for image models that uses one powerful vision language reward model to guide many different image editing tasks, delivering more consistent quality and business value than task-specific fine-tuning approaches.

ELI5 Introduction

Imagine you are coloring a picture and an adult sits next to you and says which version looks nicer, which version keeps the shapes correct, and which version has the words in the right place.

Over time you learn what makes a picture good in many different ways at once and start to color better without needing the adult every time.

Modern image models do something similar. They try to fill missing parts of an image, extend an image, remove objects, or draw text inside pictures, and they learn from human preferences about which result looks better.

OneReward is like having one very smart art teacher who can judge all these tasks at once, instead of having a different teacher for each job. This single teacher helps the model improve faster, stay more consistent, and deliver outputs that people actually like in real creative and commercial workflows.

What OneReward Actually Is

OneReward is a reinforcement learning from human feedback (RLHF) framework designed for the visual domain, focused on mask-guided image generation and editing tasks such as image fill, image extend, object removal, and text rendering.

Instead of training separate reward models and policies for each task, OneReward uses a single vision language model as a generative reward model that can evaluate images across many tasks and quality dimensions.

At a high level, OneReward introduces three key ideas for visual generative systems:

  • A unified reward model that understands many tasks
  • A multi-task training loop that optimizes one policy across those tasks
  • A practical implementation that has already produced a strong editing model called Seedream 3 Fill and improved FLUX Fill style checkpoints

This makes OneReward strategically important for teams building image generation products, especially those that must support several editing modes with consistent brand quality.

How Reinforcement Learning From Human Feedback Works in Images

Reinforcement learning from human feedback aligns an AI system with human preferences by learning a reward model from human comparison data and then optimizing the generation policy to maximize that learned reward.

In language, this process is used to make chat models more helpful and safe. In the visual domain, it is used to make generated images more usable, aesthetic, and faithful to prompts.

The core steps are conceptually simple:

  1. Generate multiple candidate outputs for the same input
  2. Ask human annotators which one they prefer along different quality dimensions
  3. Train a reward model that predicts which option people will pick
  4. Train the generator to produce outputs that score higher on that learned reward

OneReward applies this pattern to mask-guided image generation and extends it in two ways. It uses a single vision language reward model for many tasks and many quality criteria, and it introduces a multi-task reinforcement learning scheme that lets one policy improve across all of them at once.

Implementation Strategies

Decide where OneReward-style RLHF fits in your stack

Before designing a OneReward-inspired system, clarify its role in your architecture:

  • Core image engine for all mask-guided editing in a design tool
  • Specialized engine for e-commerce product imagery and template-based creatives
  • Back-end quality layer that reranks or refines outputs from existing diffusion models

Teams already using supervised fine-tuning or LoRA-based approaches can view OneReward-style RLHF as a next-stage upgrade when they hit a ceiling in quality or consistency.

Design your task and evaluation taxonomy

A critical upstream decision is the definition of tasks and evaluation dimensions.

Typical task categories include image fill, extend, object removal, background replacement, and text rendering in images. Quality dimensions that map well to human perception include aesthetics, structural consistency, prompt alignment, text alignment, and artifact suppression.

You should define a schema that can be encoded into the structured query used by the reward model, and ensure that this schema matches the way your annotators think about quality and the way your product owners define success.

Build a scalable preference data pipeline

OneReward relies on pairwise comparisons with multiple candidates per input. For an applied system, you need a data pipeline that can generate candidate images under controlled variations, collect human selections, and store richly structured metadata.

Key practical elements include:

  • Clear annotation interfaces that show candidates side by side and capture best and worst choices
  • Randomization of candidate ordering to reduce bias
  • Logging of parameters such as steps, noise strength, and guidance scale for later analysis
  • A process to update sampling strategies as the policy improves, to keep the learning signal dense

If you already have production telemetry on which creatives users accept, publish, or edit further, you can supplement curated preference labeling with implicit feedback.

Integrate the unified reward model

Implementation-wise, OneReward uses a vision language model as the reward backbone, which simplifies cross-task generalization. You can adopt a similar pattern by starting from a capable multimodal foundation model and fine-tuning it on your own comparison data to predict pairwise preferences.

When designing the input format, keep the structured query idea:

  • Explicit tokens or phrases that indicate the task type
  • Clear wording for each evaluation dimension
  • Inclusion or omission of the original prompt depending on whether it is needed for the decision

This reduces the need to train separate reward heads and enables future extension to new tasks with minimal changes.

Optimize the policy with guardrails

In the OneReward framework, the policy is optimized against the reward model while a reference model and a reward upper bound act as stabilizers.

In your implementation, carry over the following ideas:

  • Use a stable reference baseline rather than updating everything at once
  • Cap the effective reward so the policy cannot exploit degenerate behavior
  • Monitor reward scores alongside external metrics such as human evaluation and business KPIs

You can also reuse the exponential moving average of the policy as a dynamic reference, which the OneReward paper proposes to reduce memory usage and maintain a strong baseline.

Best Practices And Case Examples

Best practices drawn from OneReward

Several practices from OneReward translate directly into operational guidelines:

  • Treat tasks and dimensions as first-class objects, not ad hoc tags
  • Use multi-dimensional labels so you can tune for tradeoffs rather than a single metric
  • Train the reward model on pairwise data, which is more stable and easier for humans to provide than absolute scores
  • Keep the reward model relatively general and factor task-specific logic into the query

Another best practice is to integrate evaluation deeply into the training loop. In OneReward, balanced improvement across dimensions is achieved by averaging rewards and enforcing loss structures that discourage over-optimization around a single aspect like aesthetics.

Seedream Fill as a reference point

Seedream Fill is a mask-guided generation model trained on top of a Seedream base model using the OneReward reinforcement learning framework. It is capable of handling diverse tasks in a unified way, including image fill, extend, object removal, and text rendering, and it has been shown to outperform both commercial and open-source baselines on multiple evaluation metrics.

For product teams, this shows that a unified RLHF approach is not just a research curiosity. It can directly power a production-grade editing engine that competes with established tools used by designers and marketers.

FLUX Fill and ecosystem integration

The authors also open-sourced an improved FLUX Fill variant trained with OneReward, and there are community discussions around adding support for OneReward-based models in ComfyUI.

This signals an emerging ecosystem where unified reward models become reusable components across toolchains rather than isolated research assets. If your organization is building workflows on top of diffusion engines, this kind of integration suggests an adoption path where you plug in a OneReward-inspired reward model and policy into existing UIs and pipelines.

Actionable Next Steps For Product And Engineering Leaders

Step one: Define your strategic use cases

Start with a short list of business-critical visual tasks where consistent quality matters most. Examples include e-commerce product imagery, marketing banners with text overlays, social media templates, or document illustrations with strict layout constraints.

For each, specify whether the core operations are fill, extend, removal, or text rendering, and define the main quality dimensions you care about.

Step two: Stand up a small-scale preference labeling program

Instead of waiting for a perfect dataset, begin with a focused pilot where you collect pairwise preferences for a few hundred or a few thousand inputs across your priority tasks. Use your current best model to generate candidate images and design simple best-and-worst selection tasks for internal reviewers or trusted partners.

This gives you an initial dataset to fine-tune a vision language reward model and to test the effect of reward-guided training on real business cases.

Step three: Prototype a unified reward model

Select a strong base vision language model and fine-tune it on your pilot data following a pairwise Bradley-Terry-style objective similar to the one used in OneReward. Ensure your implementation supports structured queries that encode task and dimension so you can extend the system easily as new use cases appear.

Evaluate the reward model offline by checking how often it agrees with held-out human judgments and by running small ablation studies that vary queries and inputs.

Step four: Integrate reinforcement learning into your model training

Once the reward model is stable, introduce a reinforcement learning loop that optimizes your diffusion or other generative model against this reward while keeping a reference baseline. Apply the loop across all included tasks according to a sampling schedule that reflects their relative priority in your product roadmap.

Monitor not only the reward but also downstream metrics such as editor acceptance rate, time to completion, or manual retouching time.

Step five: Build governance and iteration routines

Because human preferences and brand standards evolve, treat the reward model as a living asset:

  • Schedule regular rounds of new preference data collection
  • Review evaluation prompts and dimensions with designers and marketers
  • Audit for unintended biases in which styles or demographics get higher scores

This keeps the system aligned with changing market expectations and internal guidelines.

Conclusion

OneReward demonstrates that a single, well-designed reward model can coordinate multi-task reinforcement learning for complex visual editing workflows and deliver step-change gains in usability and perceived quality compared with task-specific tuning alone.

For organizations that rely on image generation for marketing, product display, or creative tooling, this offers a path to consolidate fragmented model efforts into a unified reinforcement learning from human feedback system that is easier to scale, govern, and iterate.

By defining a precise task and quality taxonomy, investing in pairwise preference data, deploying a structured vision language reward model, and embedding reinforcement learning into your training stack, you can move from generic diffusion models to a differentiated, brand-aligned visual engine grounded in human judgment rather than purely synthetic metrics.

Shopping Cart