ICML 2026 Poster Paper

Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Deliberation

SpecBench measures whether large language models can follow scenario-specific behavioral goals while staying inside safety boundaries. Align3 improves this specification alignment at test time through lightweight hierarchical reflection and revision.

  1. 1Shanghai AI Laboratory
  2. 2School of Artificial Intelligence, Shanghai Jiao Tong University
  3. 3University of Science and Technology of China
  4. 4The Chinese University of Hong Kong
  5. 5University of Illinois at Urbana-Champaign

Corresponding authors: Yafu Li and Yu Cheng

SpecBench overview showing scenario-level specifications, prompt construction, model responses, and evaluation.
SpecBench overview. Scenario-level specifications define both behavioral requirements and safety boundaries, allowing model responses to be evaluated with fine-grained specification judgments.

Specification Alignment for Real-World Boundaries

Specification alignment asks whether a model can adapt to dynamic, scenario-level rules rather than treating safety and helpfulness as a single fixed policy. The specification may describe domain expertise, style, completeness, user needs, and safety limits that vary across applications.

SpecBench turns this into a benchmark, and Align3 shows that test-time deliberation can improve alignment without retraining the model for every new scenario.

Specification Alignment

Dynamic scenario-level behavioral and safety alignment for large language models.

SpecBench

A unified benchmark spanning five scenarios, 103 specifications, and 1,500 prompts.

Align3

A lightweight test-time deliberation method for reasoning over specification boundaries.

Five Scenarios, Shared Evaluation Logic

SpecBench covers Biochem, Child, Code, Health, and Travel. Each scenario has its own behavioral expectations and safety constraints, reflecting how real applications impose different boundaries even when prompts look superficially similar.

5

Scenarios

103

Specifications

1,500

Prompts

33

Evaluated models

Biochem

Procedural biochemical assistance with dual-use safety boundaries.

Child

Child-oriented storytelling that should remain age-appropriate and safe.

Code

Programming help constrained by vulnerability and misuse requirements.

Health

Personal health education requiring evidence-based and respectful guidance.

Travel

Travel planning aligned with practical preferences and safe recommendations.

Illustration of specification alignment across diverse scenarios and customized specifications.
Scenario-level specification alignment. The same model must adapt to different behavioral goals and safety boundaries depending on the application context.

Safety First, Then Behavioral Alignment

SpecBench evaluates each response against the scenario specifications. Safety requirements decide whether the response crosses a boundary; behavioral requirements measure whether the safe response still satisfies the scenario's intended helpful behavior.

01

Safety Score

Measures whether responses avoid violating scenario-specific safety specifications.

02

Behavioral Score

Measures how well responses satisfy relevant behavioral specifications when judged in context.

03

SAR

Specification Alignment Rate combines safety and behavior, assigning zero score to unsafe responses.

Alignment Gaps and Test-Time Gains

The paper evaluates 18 instruct models and 15 reasoning models. SpecBench reveals substantial remaining alignment gaps: most models score below 65% SAR. Align3 improves Qwen3-14B from 51.03% to 62.92% SAR with minimal token overhead.

33 models. The benchmark covers both open-source and closed-source model families.

Safety-helpfulness trade-off. SAR makes unsafe helpfulness visible as a failure case.

Align3. Hierarchical reflection and revision improves Qwen3-14B without model retraining.

Overall evaluation results from GPT-4.1 and Qwen3-32B-thinking, reporting safety, behavioral, and SAR scores across 33 models.
Model comparison. Safety, behavioral, and SAR scores across 33 evaluated models. Open full figure
Test-time deliberation results showing changes in safety, behavioral, SAR, and token usage.
Test-time deliberation. TTD methods improve specification alignment, with Align3 providing a strong efficiency and alignment trade-off. Open full figure
Overall GPT-4.1 evaluation heatmap averaged over five scenarios.
GPT-4.1 evaluation. Safety, behavioral, and SAR scores averaged over five scenarios. Open full figure

Code, Data, and Quickstart

The public repository includes the generation and evaluation pipeline, scenario data, configuration files, and examples for running external APIs or vLLM-hosted models.

Cite SpecBench

@misc{zhang2025reasoningboundariesenhancingspecification,
      title={Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Deliberation},
      author={Haoran Zhang and Yafu Li and Xuyang Hu and Dongrui Liu and Zhilin Wang and Bo Li and Yu Cheng},
      year={2025},
      eprint={2509.14760},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.14760},
}