ICML 2026 Poster Paper

Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Deliberation

SpecBench measures whether large language models can follow scenario-specific behavioral goals while staying inside safety boundaries. Align3 improves this specification alignment at test time through lightweight hierarchical reflection and revision.

Haoran Zhang^1,2
Yafu Li^1,4,†
Xuyang Hu¹
Dongrui Liu²
Zhilin Wang³
Bo Li⁵
Yu Cheng^4,†

¹Shanghai AI Laboratory
²School of Artificial Intelligence, Shanghai Jiao Tong University
³University of Science and Technology of China
⁴The Chinese University of Hong Kong
⁵University of Illinois at Urbana-Champaign

^†Corresponding authors: Yafu Li and Yu Cheng

Paper arXiv Code Dataset Citation

Overview

Specification Alignment for Real-World Boundaries

Specification alignment asks whether a model can adapt to dynamic, scenario-level rules rather than treating safety and helpfulness as a single fixed policy. The specification may describe domain expertise, style, completeness, user needs, and safety limits that vary across applications.

SpecBench turns this into a benchmark, and Align3 shows that test-time deliberation can improve alignment without retraining the model for every new scenario.

Specification Alignment

Dynamic scenario-level behavioral and safety alignment for large language models.

SpecBench

A unified benchmark spanning five scenarios, 103 specifications, and 1,500 prompts.

Align3

A lightweight test-time deliberation method for reasoning over specification boundaries.

Benchmark

Five Scenarios, Shared Evaluation Logic

SpecBench covers Biochem, Child, Code, Health, and Travel. Each scenario has its own behavioral expectations and safety constraints, reflecting how real applications impose different boundaries even when prompts look superficially similar.

Scenarios

103

Specifications

1,500

Prompts

Evaluated models

Biochem

Procedural biochemical assistance with dual-use safety boundaries.

Child

Child-oriented storytelling that should remain age-appropriate and safe.

Code

Programming help constrained by vulnerability and misuse requirements.

Health

Personal health education requiring evidence-based and respectful guidance.

Travel

Travel planning aligned with practical preferences and safe recommendations.

Illustration of specification alignment across diverse scenarios and customized specifications. — **Scenario-level specification alignment.** The same model must adapt to different behavioral goals and safety boundaries depending on the application context.

Evaluation

Safety First, Then Behavioral Alignment

SpecBench evaluates each response against the scenario specifications. Safety requirements decide whether the response crosses a boundary; behavioral requirements measure whether the safe response still satisfies the scenario's intended helpful behavior.

Safety Score

Measures whether responses avoid violating scenario-specific safety specifications.

Behavioral Score

Measures how well responses satisfy relevant behavioral specifications when judged in context.

SAR

Specification Alignment Rate combines safety and behavior, assigning zero score to unsafe responses.

Results

Alignment Gaps and Test-Time Gains

The paper evaluates 18 instruct models and 15 reasoning models. SpecBench reveals substantial remaining alignment gaps: most models score below 65% SAR. Align3 improves Qwen3-14B from 51.03% to 62.92% SAR with minimal token overhead.

33 models. The benchmark covers both open-source and closed-source model families.

Safety-helpfulness trade-off. SAR makes unsafe helpfulness visible as a failure case.

Align3. Hierarchical reflection and revision improves Qwen3-14B without model retraining.

Overall evaluation results from GPT-4.1 and Qwen3-32B-thinking, reporting safety, behavioral, and SAR scores across 33 models. — **Model comparison.** Safety, behavioral, and SAR scores across 33 evaluated models. Open full figure

Test-time deliberation results showing changes in safety, behavioral, SAR, and token usage. — **Test-time deliberation.** TTD methods improve specification alignment, with Align3 providing a strong efficiency and alignment trade-off. Open full figure

Overall GPT-4.1 evaluation heatmap averaged over five scenarios. — **GPT-4.1 evaluation.** Safety, behavioral, and SAR scores averaged over five scenarios. Open full figure

Resources

Code, Data, and Quickstart

The public repository includes the generation and evaluation pipeline, scenario data, configuration files, and examples for running external APIs or vLLM-hosted models.

GitHub Code, scripts, configuration, and evaluation pipeline. Hugging Face Dataset SpecBench dataset release. arXiv Paper abstract, metadata, and PDF link. Quickstart Minimal evaluation example using the SpecBench Python package.

Citation

Cite SpecBench

@misc{zhang2025reasoningboundariesenhancingspecification,
      title={Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Deliberation},
      author={Haoran Zhang and Yafu Li and Xuyang Hu and Dongrui Liu and Zhilin Wang and Bo Li and Yu Cheng},
      year={2025},
      eprint={2509.14760},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.14760},
}