Context Engineering for Trustworthiness:
Rescorla–Wagner Steering
Under Mixed and Inappropriate Contexts
A simple, generalizable approach to make LLMs internally identify and ignore inappropriate signals in mixed contexts.
Abstract
We introduce the Poisoned Context Testbed with realistic mixed contexts and adapt a Rescorla–Wagner (RW) model to explain how competing signals steer LLMs. Even minimal inappropriate content can disproportionately degrade responses. To counter this, we propose RW-Steering, a two-stage fine-tuning that trains models to detect and discount harmful signals, reversing undesired behavior curves and improving quality. RW-Steering generalizes across contamination types, positions, and ratios, outperforming alignment and filtering baselines with strong gains under limited supervision.
Highlights
- Behavioral Insight — Small contamination can over-steer LLMs; RW conditioning explains the dynamics.
- Testbed — Poisoned Context Testbed with realistic mixed contexts to stress-test behavior curves.
- Method & Results — RW-Steering reverses behavior curves and outperforms filtering/alignment baselines (~+39.8% avg. gain).
Methodology
Poisoned Context Testbed
A controlled benchmark pairing real queries with web/RAG-like mixed contexts that blend evidence with inappropriate content (e.g., fake news, hate speech, non-factual claims, privacy leaks). By varying contamination type, position, and ratio, we trace behavior curves and show that even minor contamination can steer answers—providing a rigorous stress test for robustness and steering methods.

RW Model vs. LLM Behavior Curves
UP: Behavior curves predicted by our adapted Rescorla–Wagner (RW) model and the actual responses of three LLMs when exposed to two types of contextual information. As the proportion of the first type (C1) increases, the RW model's predictions closely match the LLMs' real-world outputs. Down: Behavior curves when models are exposed to disproportionate inappropriate context. Performance drops sharply when inappropriate information appears early, validating the pattern predicted by our RW model.

RW‑Steering vs. Baselines
Our Approaches for Steering the Behavior of LLMs. Left: Three baseline approaches considered, each subject to different limitations. Right: Our RW-Steering approach. We first restructure the prompt to encourage the model to jointly optimize its judgment of inappropriate context and the generation of human-preferred answers, thereby internalizing the desired behavior. We then supplement training with examples containing a small number of inappropriate context segments to address cases where the model's internal judgment may fail.

Behavior Curve Reversal
Behavior curves under disproportionate mixed context (Qwen2 & Phi-2). Each plot shows response quality (0–100) vs. the proportion of inappropriate content. Left (baseline): alignment fine-tuning degrades performance relative to the base model. Middle (baseline): context filtering yields general improvement but remains unstable. Right (ours): RW-Steering reverses the undesired behavior curve and delivers more robust, generalizable performance across mixture ratios.

Results
Table 1 — Results on Models Exposed to Proportionate Fakenews Context
Consistency and Cleanliness evaluation metrics across different models and methods
Model | Eval | With context | No context | Self‑Aligned | Human‑Aligned | Self‑Enhanced | Human‑Aligned (Awareness) | Context Filtering | RW‑Steering |
---|---|---|---|---|---|---|---|---|---|
Phi‑2 | Consistency | 66.3 | 48.5 | 82.3 | 80.7 | 77.9 | 79.8 | 75.6 | 76.2 |
Cleanliness | 53.0 | 75.5 | 79.4 | 80.6 | 82.6 | 81.4 | 58.5 | 83.9 | |
Qwen2‑1.5B | Consistency | 62.7 | 46.1 | 70.8 | 74.4 | 68.4 | 68.4 | 66.3 | 72.9 |
Cleanliness | 51.2 | 83.3 | 83.5 | 77.8 | 82.1 | 78.6 | 53.1 | 82.0 | |
gemma‑2‑2b | Consistency | 67.4 | 52.5 | 73.5 | 74.3 | 69.0 | 72.4 | 69.1 | 73.9 |
Cleanliness | 55.3 | 88.8 | 88.2 | 75.5 | 86.4 | 78.3 | 58.1 | 87.5 | |
Llama‑3.2‑1B | Consistency | 68.1 | 44.6 | 72.3 | 75.0 | 70.1 | 73.2 | 68.1 | 74.1 |
Cleanliness | 64.8 | 84.9 | 85.3 | 76.9 | 85.9 | 77.9 | 72.2 | 85.4 |
Table 2 — Results on Qwen2 Model Exposed to Disproportionate Inappropriate Context
Response Quality evaluation across contamination levels (0-45%)
Method | 0% | 5% | 10% | 15% | 20% | 25% | 30% | 35% | 40% | 45% |
---|---|---|---|---|---|---|---|---|---|---|
Baseline (With Context) | 74.1 | 61.9 | 59.2 | 54.6 | 57.1 | 59.3 | 54.8 | 57.4 | 56.4 | 56.7 |
Context Filtering | 74.9 | 72.7 | 72.1 | 64.7 | 59.9 | 58.4 | 59.2 | 60.7 | 59.1 | 59.5 |
RW‑Steering | 77.6 | 76.4 | 75.1 | 75.8 | 76.1 | 75.8 | 76.2 | 75.3 | 77.2 | 76.9 |
Table 2 (continued) — Results on Qwen2 Model Exposed to Disproportionate Inappropriate Context
Response Quality evaluation across contamination levels (50-95%)
Method | 50% | 55% | 60% | 65% | 70% | 75% | 80% | 85% | 90% | 95% |
---|---|---|---|---|---|---|---|---|---|---|
Baseline (With Context) | 55.2 | 53.8 | 53.7 | 54.8 | 53.9 | 54.3 | 51.5 | 50.8 | 49.0 | 47.8 |
Context Filtering | 61.2 | 61.4 | 59.2 | 62.5 | 58.6 | 61.1 | 57.3 | 55.7 | 54.5 | 55.5 |
RW‑Steering | 75.5 | 76.9 | 76.8 | 76.1 | 78.2 | 76.1 | 76.5 | 74.5 | 74.1 | 76.2 |
Findings
Mixed-Context Vulnerability (Characterized & Addressed)
Using an RW (Rescorla–Wagner) model, we show that even minimal inappropriate content can disproportionately degrade LLM responses, and we introduce RW-Steering, a two-stage fine-tuning method that enables models to detect and discount harmful context, outperforming alignment and filtering baselines.
Toward Robust & Safe LLMs
RW-Steering generalizes across diverse contamination and improves response quality by ~39.8%, offering a practical path to safer, more reliable LLMs and extendable to agentic settings.
Case Studies
Explore how different approaches handle mixed contexts with inappropriate content. Compare model responses across contamination levels.
Input Context with Mixed Retrieved Information:
Model Response:
Ground Truth:
Citation
If you find RW-Steering useful in your research, we would appreciate it if you consider citing our work:
@article{wang2025context, title={Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts}, author={Wang, Rushi and Liu, Jiateng and Qian, Cheng and Shen, Yifan and Pan, Yanzhou and Xu, Zhaozhuo and Abbasi, Ahmed and Ji, Heng and Zhang, Denghui}, journal={arXiv preprint arXiv:2509.04500}, year={2025} }