Steerable Adversarial Scenario Generation through Test-Time Preference Alignment

Nie, Tong; Mei, Yuewen; Tang, Yihong; He, Junlin; Sun, Jie; Shi, Haotian; Ma, Wei; Sun, Jian

Steerable Adversarial Scenario Generation through Test-Time Preference Alignment

Tong Nie^1,2*, Yuewen Mei^2*, Yihong Tang³, Junlin He¹, Jie Sun², Haotian Shi², Wei Ma^1†, Jian Sun^2†

¹The Hong Kong Polytechnic University, ²Tongji University, ³McGill University
ArXiv Preprint
^*Equal Contribution, ^†Corresponding Authors

Paper Code (Coming soon) arXiv

SAGE enables fine-grained control over adversarial scenario generation at test time. By simply adjusting a preference weight without retraining, we can smoothly steer the generated behavior from naturalistic and compliant to challenging and highly adversarial.

Abstract

Adversarial scenario generation is a cost-effective approach for the safety assessment of autonomous driving systems. However, existing methods are either overly aggressive, resulting in low realism, or constrained to a fixed trade-off between multiple objectives, lacking the flexibility to adapt to diverse training and testing needs without costly retraining. Here, we reframe the task as a multi-objective preference alignment problem and introduce the first framework for Steerable Adversarial scenario GEneration (SAGE). SAGE enables fine-grained test-time control over the trade-off between adversariality and realism without any retraining. We first propose hierarchical group-based preference optimization, a data-efficient offline alignment method that learns to balance competing objectives by decoupling hard feasibility constraints from soft preferences. Instead of training a fixed model, SAGE fine-tunes two experts on opposing preferences and constructs a continuous spectrum of policies at inference time by linearly interpolating their weights. We provide theoretical justification for this scheme through the lens of linear mode connectivity. Extensive experiments demonstrate that SAGE not only generates scenarios with a superior balance of adversariality and realism but also enables more effective closed-loop training of driving policies.

Method Overview

Overview of the SAGE framework, showing limitations of existing methods, our solution via test-time alignment, and its application in closed-loop training.

Left: Existing methods suffer from a fixed trade-off between adversariality and realism. Center: Our method, SAGE, reframes the problem as preference alignment. We train two expert models (adversarial and realistic) and interpolate their weights at test time to steer generation across a continuous spectrum of behaviors. Right: This steerability is highly effective for dual curriculum learning in closed-loop training.

Results Gallery

Visualization of steerable generation from compliant to aggressive cut-in.

Test-Time Steerability: Generated trajectories smoothly transition from compliant to aggressive as the adversarial weight increases from 0.0 to 1.0.

Comparison of generated scenarios between SAGE and other baseline methods.

High-Quality Scenarios: SAGE generates challenging yet physically plausible maneuvers, crucial for meaningful safety validation, while baselines often produce awkward or rule-violating trajectories.

Pareto front comparison showing SAGE's superior trade-off.

Superior Trade-off: Our weight-mixing strategy (blue line) traces a superior Pareto front, achieving better realism for any given level of adversariality compared to other mixing strategies.

Training performances of the agent under different scenario generation methods.

Enhanced Driving Policy: The agent trained with our SAGE demonstrates a clear advantage across key metrics like reward and completion rate compared to baseline training methods.

Testing Results: Replay Policy

Evaluation of adversarial generation methods against the Replay policy. Higher is better for Attack Success Rate and Adversarial Reward (↑), while lower is better for penalty metrics (↓).

Methods	Attack Succ. Rate ↑	Adv. Reward ↑	Real. Pen. ↓		Map Comp. ↓		Dist. Diff. (WD)
Methods	Attack Succ. Rate ↑	Adv. Reward ↑	Behav.	Kine.	Crash Obj.	Cross Line	Accel.	Vel.	Yaw
Rule	100.00%	5.048	2.798	5.614	1.734	7.724	2.080	8.546	0.204
CAT	94.85%	3.961	8.941	3.143	2.466	9.078	1.556	7.233	0.225
KING	40.85%	2.243	5.883	3.434	3.126	6.056	0.972	255.5	0.098
AdvTrajOpt	70.46%	2.652	4.500	2.775	2.547	10.16	1.754	6.177	0.268
SEAL	63.93%	1.269	3.017	2.423	2.732	11.612	1.544	6.959	0.202
SAGE (w_adv=0.0)	16.26%	1.065	0.332	1.998	0.677	0.948	1.459	9.313	0.054
SAGE (w_adv=0.5)	50.41%	2.523	0.483	2.064	0.755	0.949	1.521	8.471	0.079
SAGE (w_adv=1.0)	76.15%	4.121	1.429	2.479	0.731	1.084	2.098	8.088	0.184

Testing Results: RL Policy

Evaluation of adversarial generation methods against the RL policy.

Methods	Attack Succ. Rate ↑	Adv. Reward ↑	Real. Pen. ↓		Map Comp. ↓		Dist. Diff. (WD)
Methods	Attack Succ. Rate ↑	Adv. Reward ↑	Behav.	Kine.	Crash Obj.	Cross Line	Accel.	Vel.	Yaw
Rule	65.57%	2.761	2.180	113.7	1.803	6.148	10.85	13.47	0.336
CAT	30.33%	1.319	8.191	3.039	2.623	6.967	1.539	8.877	0.187
KING	19.14%	1.148	2.041	2.596	3.114	5.857	0.983	259.1	0.097
AdvTrajOpt	19.40%	0.992	4.542	2.779	2.459	9.973	1.749	6.187	0.269
SEAL	31.40%	0.752	5.871	2.684	3.030	11.98	1.563	8.267	0.267
SAGE (w_adv=0.0)	11.20%	0.722	0.332	2.000	0.738	0.956	1.456	9.344	0.055
SAGE (w_adv=0.5)	13.66%	0.819	0.496	2.066	0.820	0.820	1.515	8.475	0.080
SAGE (w_adv=1.0)	28.42%	1.400	1.468	2.496	0.792	1.366	2.098	8.114	0.188

Closed-loop RL Training

Evaluation of Trained RL Policies in the Log-replay (Normal, WOMD) Environments.

Methods	Reward ↑	Cost ↓	Compl. ↑	Coll. ↓	Ave. Speed ↑	Ave. Jerk ↓
SAGE	51.99 ± 1.22	0.48 ± 0.05	0.77 ± 0.02	0.16 ± 0.05	9.27 ± 0.03	24.97 ± 0.53
CAT	46.81 ± 4.33	0.50 ± 0.05	0.67 ± 0.02	0.18 ± 0.05	7.21 ± 0.05	28.15 ± 1.06
Replay (No Adv)	50.16 ± 5.32	0.50 ± 0.07	0.72 ± 0.04	0.23 ± 0.02	9.03 ± 0.03	27.53 ± 0.98
Rule-based Adv	44.61 ± 3.88	0.52 ± 0.05	0.63 ± 0.04	0.13 ± 0.00	6.00 ± 0.10	28.22 ± 1.44

Qualitative Comparisons

RAW

CAT

Rule

SAGE

RAW

CAT

Rule

SAGE

BibTeX

@article{nie2025steerable,
  title={Steerable Adversarial Scenario Generation through Test-Time Preference Alignment},
  author={Nie, Tong and Mei, Yuewen and Tang, Yihong and He, Junlin and Sun, Jie and Shi, Haotian and Ma, Wei and Sun, Jian},
  journal={arXiv preprint arXiv:2509.20102},
  year={2025}
}