The existing trajectory-based approach calculated probabilities by backtracking all intermediate stages of image generation (noisy states).
- Problem: Since all steps must be calculated, it consumes enormous memory. Also, only stochastic SDE (Stochastic Differential Equation) samplers can be used, making computational costs expensive.
In contrast, the ELBO-based approach proposed by the researchers only requires the final generated image.
- Solution: It approximates probability using the Evidence Lower Bound (ELBO). There's no need to track intermediate steps one by one, saving memory, and any sampler (black-box) can be used.
Research results show that using the ELBO method dramatically speeds up training compared to existing methods, and the final performance (GenEval score) is also much higher.

Key Finding 2: Complex Techniques Are Not Necessary (Simple is Best)
It was revealed that techniques considered essential in LLM reinforcement learning, such as Clipping and Advantage Normalization, have little effect in diffusion models.
- Clipping: Used in PPO to prevent the model from changing too drastically, but in diffusion, it either hindered training or had no effect.
- CFG (Classifier-Free Guidance): Normally CFG is enabled during training to produce high-quality images, but the researchers confirmed that turning off CFG during training is faster and solves the train-inference mismatch problem.
In conclusion, the combination of "Accurate probability estimation (ELBO) + Simple loss function (EPG/PEPG) + Fast sampler (ODE)" is the strongest.

Implementation Guide for Developers: ELBO-based RL
Here's a brief Python implementation of the core logic of this paper - ELBO-based probability estimation and update process. Focus on understanding the intuitive flow rather than complex formulas.
import torch
def compute_elbo_likelihood(model, x_0, text_embeddings):
"""
Function to estimate image log likelihood using ELBO
Args:
model: Diffusion model being trained
x_0: Final generated image (Clean Image)
text_embeddings: Prompt embeddings
"""
batch_size = x_0.shape[0]
t = torch.rand(batch_size, device=x_0.device)
epsilon = torch.randn_like(x_0)
alpha_t, sigma_t = get_noise_schedule(t)
x_t = alpha_t * x_0 + sigma_t * epsilon
pred_velocity = model(x_t, t, encoder_hidden_states=text_embeddings)
target_velocity = epsilon - x_0
w_t = 1.0
mse_loss = torch.sum((pred_velocity - target_velocity)**2, dim=[1, 2, 3])
log_likelihood_estimate = -w_t * mse_loss
return log_likelihood_estimate
def train_step(model, old_model, prompt):
with torch.no_grad():
generated_images = ode_sampler(old_model, prompt, num_steps=10)
rewards = compute_reward(generated_images, prompt)
log_prob_current = compute_elbo_likelihood(model, generated_images, prompt)
with torch.no_grad():
log_prob_old = compute_elbo_likelihood(old_model, generated_images, prompt)
ratio = torch.exp(log_prob_current - log_prob_old)
loss = -torch.mean(ratio * rewards)
loss.backward()
optimizer.step()
As you can see in this code, you can estimate probability with just x_0 (final image) and random t without storing the entire trajectory. This is the secret to the speed improvement.
Experimental Results: Overwhelming Efficiency
The paper conducted experiments using the SD 3.5 Medium model. The results were remarkable.
- Training Speed: 2x faster than the existing SOTA DiffusionNFT, and 4.6x faster than FlowGRPO. (Achieved GenEval 0.95 in just 90 hours on H100 GPU)
- Performance: GenEval score skyrocketed from 0.24 (base model) to 0.95.
- Sampling Efficiency: Even using ODE sampler (10 steps) during training showed no performance difference compared to using SDE sampler (40 steps). In other words, the same effect is achieved with much less computation.

Conclusion: Editor's Outlook (Future Outlook)
This paper goes beyond mere technical optimization to provide an important starting point for the democratization of generative AI.
First, a dramatic reduction in fine-tuning costs. Previously, fine-tuning diffusion models to human intent required massive computing resources. However, using the method proposed in this paper, models can be personalized quickly with relatively few resources. This opens the door for small and medium-sized businesses and individual researchers to have their own high-quality generation models.
Second, a shift from 'data-centric' to 'evaluation-centric'. Until now, the focus has been on collecting tens of thousands of high-quality datasets to create good images. However, once this technology is commercialized, designing a good reward function that judges "what makes a good image?" will become more important than collecting datasets.
In conclusion, the insight of this paper - "Don't change the textbook (Loss), change the exam method (Likelihood Estimation)" - is very likely to become the standard in the vision AI field going forward. This is a must-read study for engineers thinking about efficient AI modeling.

References
- Choi et al., "Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design", arXiv:2602.04663, 2026.