Skip to content

Top Stories

Top Stories

Primary Menu
  • Breaking News
  • UNIT CONVERTER
  • QR Code Generator
  • SEO META TAG GENERATOR
  • Background Remover Tool
  • Image Enhancer Tool
  • Image Converter Tool
  • Image Compressor Tool
  • Keyword Research Tool
  • Paint Tool
  • About Us
  • Contact Us
  • Privacy Policy
HOME PAGE
  • Home
  • Uncategorized
  • Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models
  • Uncategorized

Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models

VedVision HeadLines June 2, 2025
Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models


Policy gradient methods have significantly advanced the reasoning capabilities of LLMs, particularly through RL. A key tool in stabilizing these methods is Kullback-Leibler (KL) regularization, which discourages drastic changes between the current policy and the reference policy. While widely used in algorithms like PPO, there’s still much to explore in how different KL variants, such as Forward KL, Reverse KL, and their unnormalized form, can be estimated and applied within loss functions. These choices, along with various gradient estimators and on-policy vs. off-policy settings, shape training stability and performance in nuanced and underexplored ways. 

Fine-tuning LLMs with human feedback is crucial for building aligned AI systems. Two main strategies are employed: optimizing with reward models using policy gradient methods, such as PPO, and directly training on human preferences through methods like Direct Preference Optimization (DPO). While PPO stabilizes training with reward models, DPO and its variants use pairwise comparisons to simplify and scale learning, gaining popularity in recent models. Reinforcement learning is also increasingly used to enhance LLM reasoning, especially in complex tasks like math and coding. New methods aim to reduce computational costs and improve training stability, often by replacing value networks or modifying KL penalties. 

Researchers from UCLA, Tsinghua University, and Shanghai Qi Zhi introduce Regularized Policy Gradient (RPG), a unified framework for KL-regularized policy gradients in online reinforcement learning. They derive policy gradients and surrogate loss functions using both Forward and Reverse KL divergences, addressing normalized and unnormalized policies. RPG supports both fully differentiable objectives and REINFORCE-style estimators, tailored for off-policy training with importance sampling. The study also identifies and addresses theoretical issues in existing methods, such as GRPO, and examines KL regularization in REINFORCE++. Experiments on LLM reasoning tasks demonstrate that RPG achieves improved stability and performance compared to leading baselines, including GRPO, REINFORCE++, and DAPO. 

The study presents policy gradient methods that incorporate KL divergence regularization in both online and off-policy settings using importance sampling from an older policy. For forward KL, the gradient involves importance-weighted rewards and a regularization term, with its loss resembling the maximum likelihood loss when the rewards are zero. The unnormalized forward KL adds a correction for mismatched distribution masses. Similarly, reverse KL and its unnormalized form penalize deviation from the reference policy, modifying the reward based on log-probability ratios. All approaches share a REINFORCE-like gradient structure, enabling alternative implementations using the stop-gradient operator, which supports stable and efficient optimization in practice. 

The researchers conducted a thorough evaluation of their proposed RPG methods—both differentiable and REINFORCE-style—by comparing them to several established baselines on complex math reasoning tasks using Qwen2.5 language models. They trained on the DAPO-Math-17k dataset and evaluated performance using benchmarks such as AMC23 and AIME. RPG variants consistently demonstrated strong accuracy, training stability, and efficient memory usage. Implementation utilized the Verl framework and techniques such as KL regularization, PPO-style clipping, and Schedule-Free AdamW for smoother optimization. RPG models generally outperformed others in reward shaping, entropy control, and response length, highlighting their robustness and suitability for stable, high-performance learning. 

In conclusion, RPG is a comprehensive framework for designing and analyzing policy gradient methods that incorporate KL-regularization in online, off-policy reinforcement learning. They explore a range of configurations, including both forward and reverse KL divergences, normalized and unnormalized policy distributions, and two types of estimators: fully differentiable and REINFORCE-style. RPG aims to provide a structured approach to understanding and implementing these variations. Applied to reasoning tasks with large language models, the proposed methods demonstrate more stable training and competitive or improved performance compared to established baselines, such as GRPO, REINFORCE++, and DAPO. 


Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.



Source link

Continue Reading

Previous: Bessent says Trump-Xi call would ease trade logjam — ‘I believe we will see something very soon’
Next: Why is Hardik Pandya trending on X after PBKS beat MI to reach IPL 2025 final?

Related News

How to Crash a Party in Style: The Gatsby Way, Then and Now
  • Uncategorized

How to Crash a Party in Style: The Gatsby Way, Then and Now

VedVision HeadLines July 5, 2025
Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training
  • Uncategorized

Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training

VedVision HeadLines July 5, 2025
Remote Browser Isolation (RBI) Myths Debunked – Blog
  • Uncategorized

Remote Browser Isolation (RBI) Myths Debunked – Blog

VedVision HeadLines July 5, 2025

Recent Posts

  • How to Crash a Party in Style: The Gatsby Way, Then and Now
  • Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training
  • Rajinikanth’s Coolie Sets Overseas Record Ahead of Release – News Today
  • Remote Browser Isolation (RBI) Myths Debunked – Blog
  • Dogecoin (DOGE) Price Holds 16 Cent Support as Bulls Defend Multi-Week Floor

Recent Comments

No comments to show.

Archives

  • July 2025
  • June 2025
  • May 2025
  • April 2025

Categories

  • Current Affairs
  • Shopping
  • Uncategorized

You may have missed

How to Crash a Party in Style: The Gatsby Way, Then and Now
  • Uncategorized

How to Crash a Party in Style: The Gatsby Way, Then and Now

VedVision HeadLines July 5, 2025
Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training
  • Uncategorized

Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training

VedVision HeadLines July 5, 2025
Rajinikanth’s Coolie Sets Overseas Record Ahead of Release – News Today
  • Current Affairs

Rajinikanth’s Coolie Sets Overseas Record Ahead of Release – News Today

VedVision HeadLines July 5, 2025
Remote Browser Isolation (RBI) Myths Debunked – Blog
  • Uncategorized

Remote Browser Isolation (RBI) Myths Debunked – Blog

VedVision HeadLines July 5, 2025
Copyright © All rights reserved. | MoreNews by AF themes.