The IFD framework consists of three synergistic components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, (2) IntentCheck, a bypass module enforcing intent alignment, and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions.
Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFD), a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFD achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates.
Our framework distinguishes between two fundamental aspects of instruction challenge: difficulty and complexity. Understanding this distinction is crucial for designing effective instruction following training.
Definition: The actual challenge level of completing an instruction successfully.
Measurement: Based on model pass rates when evaluated under verification criteria.
Categories:
Definition: The structural complexity, determined by the number of constraints.
Measurement: Count of constraints within the instruction.
Levels:
🔍 Key Insight: Complexity and difficulty are weakly correlated dimensions. A simple instruction with few constraints can be genuinely difficult to execute, while a complex instruction with many constraints might be straightforward to fulfill. Our cooperative-adversarial data flywheel leverages this distinction to generate appropriately challenging training data across both dimensions.
Traditional hybrid verification \(V_H(I,R)\) combines rule-based scripts and LLM-based criteria, but acts as a proxy metric that can be exploited under optimization pressure. IntentCheck addresses this by focusing on the actual intent behind instructions rather than surface-level constraints.
IntentCheck \(V_T(I,R)\) uses large language models to extract instruction intent and directly judges whether responses fulfill that intent, employing a strict binary strategy (1 for complete success, 0 for any failure) to ensure robust intent alignment.
Trip wires function like "sting operations" - they contain deliberate exploit patterns designed to lure models into exploiting verification shortcuts. Based on observed behaviors, we design seemingly exploitable instructions that tempt models to take shortcuts, then capture these hacking behaviors using predefined detection patterns.
The Macro Hack Rate (MHR) quantifies the fraction of trip wire instructions where the model exhibits exploitative behaviors. Crucially, trip wires operate independently of training and remain invisible to the policy, ensuring they don't interfere with reward computations.
As Goodhart's law states, "when a metric is used as a target, it ceases to be a good metric." This principle directly applies to our context: when verification metrics become optimization targets in RLVR training, models learn to exploit these metrics rather than genuinely following instructions.
Examples of how models exploit verifications without following instructions
Our IFD framework significantly improves instruction following ability across diverse configurations. Qwen2.5-32B-Instruct-IFD achieves 87.43% on IFEval and 66.76% on FollowBench, outperforming larger models while preserving general capabilities.
The relationship between instruction following performance (IF-Eval) and hack resistance (1 - Macro Hack Rate). IFD guides LLMs toward the upper-right region, achieving strong instruction following with robust hack resistance.
Distribution of difficulty and complexity levels in our synthetic dataset. Difficulty levels are based on model pass rates: Very Hard (0, 0.125], Hard (0.125, 0.25], Medium (0.25, 0.375], and Easy (0.375, 0.5]. Complexity levels are categorized by constraint count: L0–L2 (≤2 constraints), L3–L4 (3-4 constraints), and L5+ (≥5 constraints). The distribution shows balanced representation across both dimensions.
@misc{guo2025ifdecoratorwrappinginstructionfollowing,
title={IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards},
author={Xu Guo and Tianyi Liang and Tong Jian and Xiaogui Yang and Ling-I Wu and Chenhui Li and Zhihui Lu and Qipeng Guo and Kai Chen},
year={2025},
eprint={2508.04632},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.04632},
}