Instruction Following Decorator: A Framework for Robust and Sample-Efficient RLVR Training

Xu Guo1,2,3*, Tianyi Liang1,2,4*, Tong Jian1, Xiaogui Yang1,
Ling-I Wu1, Chenhui Li4, Zhihui Lu3†, Qipeng Guo1†, Kai Chen1†
1Shanghai AI Laboratory, 2Shanghai Innovation Institute
3Fudan University, 4East China Normal University
*Equal Contribution. Corresponding authors.
{tongjian, yangxiaogui, wulianyi, guoqipeng, chenkai}@pjlab.org.cn,
guox24@m.fudan.edu.cn, tyliang@stu.ecnu.edu.cn, chli@cs.ecnu.edu.cn, lzh@fudan.edu.cn

IFD Framework Overview

The IFD framework consists of three synergistic components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, (2) IntentCheck, a bypass module enforcing intent alignment, and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFD), a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFD achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates.

Core Concepts

Difficulty vs Complexity

Our framework distinguishes between two fundamental aspects of instruction challenge: difficulty and complexity. Understanding this distinction is crucial for designing effective instruction following training.

Difficulty

Definition: The actual challenge level of completing an instruction successfully.

Measurement: Based on model pass rates when evaluated under verification criteria.

Categories:

  • Very Hard: Pass rate (0, 0.125]
  • Hard: Pass rate (0.125, 0.25]
  • Medium: Pass rate (0.25, 0.375]
  • Easy: Pass rate (0.375, 0.5]

Complexity

Definition: The structural complexity, determined by the number of constraints.

Measurement: Count of constraints within the instruction.

Levels:

  • L0–L2: ≤ 2 constraints (Simple)
  • L3–L4: 3-4 constraints (Medium)
  • L5+: ≥ 5 constraints (Complex)

🔍 Key Insight: Complexity and difficulty are weakly correlated dimensions. A simple instruction with few constraints can be genuinely difficult to execute, while a complex instruction with many constraints might be straightforward to fulfill. Our cooperative-adversarial data flywheel leverages this distinction to generate appropriately challenging training data across both dimensions.

IntentCheck: Bypassing Complex Verification

Traditional hybrid verification \(V_H(I,R)\) combines rule-based scripts and LLM-based criteria, but acts as a proxy metric that can be exploited under optimization pressure. IntentCheck addresses this by focusing on the actual intent behind instructions rather than surface-level constraints.

\[ V_H(I,R) = \left( \bigwedge_{s \in \text{Scripts}} s(I,R) \right) \land \left( \bigwedge_{c \in \text{Criteria}} c(I,R) \right) \]
\[ R_{final}(I,R) = V_T(I,R) \land V_H(I,R) \]

IntentCheck \(V_T(I,R)\) uses large language models to extract instruction intent and directly judges whether responses fulfill that intent, employing a strict binary strategy (1 for complete success, 0 for any failure) to ensure robust intent alignment.

Trip Wires: Detecting Reward Hacking

Trip wires function like "sting operations" - they contain deliberate exploit patterns designed to lure models into exploiting verification shortcuts. Based on observed behaviors, we design seemingly exploitable instructions that tempt models to take shortcuts, then capture these hacking behaviors using predefined detection patterns.

\[ \text{MHR}(\pi_\theta) = \frac{1}{|T|} \sum_{I \in T} \mathbf{1}\left[\bigvee_{d \in D_I} d(I, R_{\pi_\theta}(I))\right] \]

The Macro Hack Rate (MHR) quantifies the fraction of trip wire instructions where the model exhibits exploitative behaviors. Crucially, trip wires operate independently of training and remain invisible to the policy, ensuring they don't interfere with reward computations.

As Goodhart's law states, "when a metric is used as a target, it ceases to be a good metric." This principle directly applies to our context: when verification metrics become optimization targets in RLVR training, models learn to exploit these metrics rather than genuinely following instructions.

Reward Hacking Cases

Examples of how models exploit verifications without following instructions

Key Results

Performance Comparison

Main Results Table

Our IFD framework significantly improves instruction following ability across diverse configurations. Qwen2.5-32B-Instruct-IFD achieves 87.43% on IFEval and 66.76% on FollowBench, outperforming larger models while preserving general capabilities.

Instruction Following vs Hack Resistance

IF vs Hack Resistance

The relationship between instruction following performance (IF-Eval) and hack resistance (1 - Macro Hack Rate). IFD guides LLMs toward the upper-right region, achieving strong instruction following with robust hack resistance.

Dataset Statistics

Trip Wires

Distribution of difficulty and complexity levels in our synthetic dataset. Difficulty levels are based on model pass rates: Very Hard (0, 0.125], Hard (0.125, 0.25], Medium (0.25, 0.375], and Easy (0.375, 0.5]. Complexity levels are categorized by constraint count: L0–L2 (≤2 constraints), L3–L4 (3-4 constraints), and L5+ (≥5 constraints). The distribution shows balanced representation across both dimensions.

Ablation Studies

Training Components Analysis

Training Ablation

📊 Experimental Setup: We conduct ablation studies to validate the effectiveness of each training component based on the complete IFD configuration. Four key configurations are examined: (1) w/o Filtering Non-Inst - removing filtering of instructions related to math, logic, and code; (2) w/o Filtering Too Hard - disabling difficulty selection in the data flywheel; (3) w/o Strict Reward - using a single question instead of a checklist; (4) w/ KL regularization - setting KL coefficient to 0.005.
💡 Key Insights:Filtering mechanisms are critical - removing non-instruction filtering degrades both instruction following and general abilities, as models learn incorrect signals from tasks requiring domain-specific knowledge. ② Difficulty filtering is essential - it significantly impacts instruction following performance while having minimal effect on general ability, since overly difficult instructions fail to generate meaningful positive-negative response pairs. ③ Strict reward (checklist) is crucial - it significantly improves both metrics by providing more accurate supervision signals compared to single questions. ④ KL regularization is detrimental - though the impact is minor in our setting.

IntentCheck Effectiveness

IntentCheck Ablation

📊 Experimental Setup: We analyze the evolution of Macro Hack Rate (MHR) across six training configurations to evaluate IntentCheck's effectiveness. The default setup employs Qwen2.5-32B-Instruct for supervision (32B as IntentCheck, w/o KL). The baseline configuration (w/o IntentCheck, w/o Criteria, w/o KL) represents naive RLVR training using only script-based verification. We also test different model sizes (7B vs 32B) for IntentCheck supervision.
💡 Key Insights:IntentCheck is essential for hack mitigation - without it, LLMs exhibit high MHR even with KL regularization, demonstrating that KL regularization alone cannot effectively prevent reward hacking. ② Model size matters but shows diminishing returns - using a 7B model as IntentCheck still significantly reduces reward hacking, while a larger 32B model achieves only marginally better mitigation, indicating the robustness of our approach. ③ Naive RLVR suffers severe hacking - reaching a maximum MHR of 0.6574, aligning with prior findings on reward over-optimization. ④ LLM-based criteria are more robust - removing criteria results in slight MHR increase, suggesting that LLM-based verification is harder to exploit than pure script-based verification.

BibTeX

@misc{guo2025ifdecoratorwrappinginstructionfollowing,
      title={IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards}, 
      author={Xu Guo and Tianyi Liang and Tong Jian and Xiaogui Yang and Ling-I Wu and Chenhui Li and Zhihui Lu and Qipeng Guo and Kai Chen},
      year={2025},
      eprint={2508.04632},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.04632}, 
}