Vector Policy Optimization: The Key to Diverse Language Model Outputs
Vector Policy Optimization promises to enhance language models by fostering diverse solutions, challenging the current low-entropy output paradigm.
Language models today face a critical demand: they must adapt to novel environments and execute within complex inference-scaling search procedures like AlphaEvolve. These procedures select rollouts based on a multitude of task-specific reward functions. However, the current model training paradigm, which focuses on optimizing a pre-specified scalar reward, hinders the ability of language models to generate the diversity required for these searches.
The Problem with Scalar Rewards
Traditional LLM (Large Language Model) post-training often results in low-entropy response distributions. This means that while the models might be adequate at handling straightforward tasks, they falter when tasked with generating diverse solutions necessary for inference-time searches. Why is this a significant issue? Simply put, as the demand for more complex and varied outputs increases, relying on scalar reward optimization is a shortcoming we can no longer afford.
Enter Vector Policy Optimization
Vector Policy Optimization (VPO) is an innovative RL (Reinforcement Learning) algorithm that seeks to resolve this problem. Unlike its predecessors, VPO trains policies to anticipate a range of downstream reward functions, resulting in outputs that are both diverse and specialized to different trade-offs within the vector reward space. This isn't just a tweak, it's a paradigm shift.
VPO leverages vector-valued rewards, which are prevalent in practice. Take code generation as an example: correctness is evaluated on a per-test-case basis. Similarly, user interactions can be assessed with multiple personas or models. VPO acts as a direct substitute for the GRPO advantage estimator but offers the significant benefit of producing diverse solutions optimized for various rewards.
Breaking New Ground
In trials across four tasks, VPO consistently matched or outperformed the strongest scalar RL baselines when it came to test-time search metrics like pass@k and best@k. As the search budget increases, the performance gap widens even further in favor of VPO. For evolutionary search processes, VPO models have solved problems that GRPO models simply can't.
As test-time search becomes increasingly standard, it raises a pertinent question: should optimizing for diversity become the default post-training objective for language models? The evidence suggests a resounding yes. The AI-AI Venn diagram is getting thicker, and if we want our models to thrive in this convergence, embracing diversity will be non-negotiable.
The Road Ahead
In an era where expectations from language models are growing exponentially, Vector Policy Optimization offers a promising path forward. It challenges the status quo by advocating for diversity and adaptability, necessary traits for navigating the complexities of modern AI tasks. If we're serious about evolving language models to meet future demands, VPO isn't just an option, it's a necessity.
Get AI news in your inbox
Daily digest of what matters in AI.