HAIR introduces a novel LLM alignment method using hardness-aware inverse reinforcement learning and introspective reasoning, constructing a balanced safety dataset and training category-specific reward models with GRPO-S, achieving state-of-the-art harmlessness while preserving usefulness across multiple benchmarks.
Large Language Model, Alignment, Reinforcement Learning, Safety, Reasoning, Dataset
Ruoxi Cheng, Haoxuan Ma, Weixin Wang
Alibaba Group, Southeast University, Duke University
Generated by grok-3
Background Problem
The alignment of large language models (LLMs) with human values is a critical challenge due to four key issues: (1) scarcity of balanced safety datasets, which are expensive to annotate and often fail to capture the complexity of human values; (2) alignment tax, where enhancing safety compromises overall model performance; (3) shallow alignment, making models vulnerable to jailbreak attacks; and (4) inability to adapt rewards dynamically based on task difficulty, leading to overfitting on easy examples and underfitting on hard ones. This work aims to address these limitations by developing a novel alignment method that leverages introspective reasoning and hardness-aware optimization to improve safety while maintaining usefulness.
Method
HAIR (Hardness-Aware Inverse Reinforcement Learning with Introspective Reasoning) is a novel LLM alignment approach with two main components:
- Balanced Safety Dataset Construction: A dataset covering seven harmful categories (e.g., Insult, Physical Harm) is created using structured Chain-of-Draft (CoD) prompts to elicit introspective reasoning from LLMs, generating refusal responses with detailed reasoning steps for each harmful query.
- Shadow Reward Models and Hardness-Aware Optimization: Category-specific shadow reward models are trained using Inverse Reinforcement Learning (IRL) on the curated dataset, following a bilevel optimization framework to learn reward functions. Data hardness (measured via CLIP similarity between generated and reference responses) and model responsiveness (measured via reward gaps) are combined into a hardness coefficient to dynamically adjust training weights. Finally, Group Relative Policy Optimization-Scaling (GRPO-S), an adaptation of GRPO, is used to align the LLM policy for each category, incorporating hardness-aware advantages to balance safety and utility during optimization.
Experiment
The experiments were conducted on two open-source LLMs (Qwen-2-7B and Llama-3.1-8B) using a balanced safety dataset of over 70,000 query-response pairs across seven harm categories, sourced from Do-not-answer and Safety-Prompts datasets. Shadow reward models were trained for each category, and GRPO-S was applied for policy alignment. Evaluation spanned four harmlessness benchmarks (StrongReject, XsTest, WildChat, Stereotype) and four helpfulness benchmarks (SimpleQA, AdvGLUE, GSM8k, AlpacaEval), comparing HAIR against eight baselines (e.g., SFT, PPO, DPO, GRPO, STAIR). Results showed HAIR achieving state-of-the-art harmlessness scores (e.g., 0.9055 on StrongReject for Llama-3.1-8B, surpassing STAIR by 2.6 points) with refusal rates above 97% on key safety tests, while maintaining competitive helpfulness (e.g., within 1-3 points of top utility baselines on AlpacaEval). The setup appears comprehensive, covering diverse metrics and model sizes (including 3B variants), and results match the expectation of improved safety with minimal alignment tax. However, the slight utility drop in some benchmarks (e.g., AlpacaEval for Llama) suggests the balance isn’t fully optimized, and reliance on synthetic data raises concerns about real-world generalizability. Ablation studies confirmed the importance of hardness coefficients, with their removal degrading performance by up to 8 points on safety metrics.
Further Thoughts
HAIR’s approach to category-specific reward models opens up interesting avenues for personalized or context-specific alignment, but it also prompts a deeper question about scalability—could this method be adapted to handle dynamic or user-defined harm categories without prohibitive computational costs? The hardness-aware mechanism is a standout feature, yet its reliance on CLIP similarity for data hardness might not fully capture semantic nuances; exploring alternative metrics like BERTScore or human-in-the-loop feedback could enhance robustness. Additionally, the synthetic nature of the safety dataset, while innovative, risks embedding model-specific biases (e.g., from GPT-3.5-turbo); cross-referencing with human-annotated datasets or integrating real-world user interactions could mitigate this. I’m also curious about HAIR’s potential intersection with other alignment techniques like RLHF or DPO—could a hybrid approach further reduce the alignment tax by leveraging preference data alongside demonstration data? Finally, the ethical implications of automated reward shaping, as noted in the paper, warrant further exploration, especially in culturally diverse contexts where harm definitions vary widely, suggesting a need for adaptive frameworks that incorporate global perspectives on safety and fairness.