ComPO: Preference Alignment via Comparison Oracles

This paper introduces ComPO, a novel preference alignment method for LLMs using comparison oracles to effectively utilize noisy preference pairs, demonstrating reduced verbosity and likelihood displacement across multiple models and benchmarks.

Large Language Model, Alignment, Reinforcement Learning, Instruction Tuning, Efficiency

Peter Chen, Xi Chen, Wotao Yin, Tianyi Lin

Columbia University, Stern School of Business, New York University, DAMO Academy, Alibaba Group US

Generated by grok-3

Background Problem

The paper addresses the alignment of large language models (LLMs) with human preferences, focusing on the limitations of direct alignment methods like Direct Preference Optimization (DPO). These methods, while simpler than Reinforcement Learning from Human Feedback (RLHF), suffer from verbosity (generating unnecessarily long responses) and likelihood displacement (unintentionally reducing the probability of preferred responses). These issues are exacerbated by noisy preference pairs—data where preferred and dispreferred responses have similar likelihoods—leading to inefficient resource use and potential safety risks (e.g., generating harmful content). The work aims to mitigate these problems by proposing a new alignment method that effectively utilizes noisy preference data.

Method

The proposed method, termed Comparison-Based Preference Optimization (ComPO), leverages comparison oracles to align LLMs with human preferences, particularly focusing on noisy preference pairs. The core idea is to treat preference pairs as comparative judgments from an implicit alignment objective, avoiding reliance on a specific proxy objective like DPO’s log-likelihood margin. The basic scheme (Algorithm 1) uses a zeroth-order optimization approach, estimating gradients via comparison oracles that assess whether one model parameter set is better than another based on likelihoods of preferred and dispreferred responses. It includes steps like perturbing parameters, querying oracles, and updating parameters with a convergence guarantee under non-convex settings. The practical scheme (Algorithm 2) enhances efficiency by restricting perturbations to output layer weights, approximating sparse gradient estimation with clipping, and adjusting step sizes based on oracle feedback. The final method combines DPO on clean data with ComPO on noisy data, using a log-likelihood margin threshold to classify pairs.

Experiment

The experiments evaluate ComPO’s effectiveness across base and instruction-tuned models (Mistral-7B, Llama-3-8B, Gemma-2-9B) using benchmarks like AlpacaEval 2, MT-Bench, and Arena-Hard. The setup splits preference data into clean and noisy subsets using a log-likelihood margin threshold (δ=3) and compares ComPO-augmented DPO and SimPO against baselines. Results show ComPO improves length-controlled win rates (LC), indicating reduced verbosity, with notable gains in AlpacaEval 2 (e.g., 35.79% LC for Llama-3-Instruct-8B with DPOclean+ComPO vs. 32.59% for DPO). It also mitigates likelihood displacement by increasing preferred response likelihood and decreasing dispreferred response likelihood, as evidenced in log-likelihood comparisons. However, performance on Arena-Hard is inconsistent, likely due to its lack of length penalty favoring longer responses. The experimental design is reasonable for demonstrating ComPO’s potential but limited by a simplistic noisy pair metric (log-likelihood vs. CHES score) and small noisy pair sample sizes (e.g., 100 pairs). Computational efficiency is achieved by updating only output layer parameters, using minimal GPU memory (30GB for Gemma-2-9B), though scalability to larger models remains untested.

Further Thoughts

The concept of using comparison oracles in ComPO opens up interesting avenues for alignment research, particularly in handling noisy data, which is often abundant in real-world preference datasets. However, the reliance on a log-likelihood margin for identifying noisy pairs seems rudimentary compared to embedding-based metrics like CHES, as noted in the paper. Future work could explore hybrid metrics combining likelihood and embedding similarities to better capture nuanced differences in responses. Additionally, the focus on output layer perturbations, while computationally efficient, might miss deeper alignment issues in earlier layers of LLMs—could methods like parameter-efficient fine-tuning (e.g., LoRA) be integrated with ComPO to balance efficiency and depth? Another thought is the potential application of ComPO to other generative AI domains, such as diffusion models for image generation, where noisy preference data might also pose challenges. Finally, the inconsistent performance on Arena-Hard suggests a need for benchmark designs that better account for response conciseness, perhaps inspiring a broader discussion in the community about evaluation metrics for aligned models.