Context Engineering for Trustworthiness:
Rescorla–Wagner Steering
Under Mixed and Inappropriate Contexts

A simple, generalizable approach to make LLMs internally identify and ignore inappropriate signals in mixed contexts.

Rushi Wang Jiateng Liu Cheng Qian Yifan Shen Yanzhou Pan Zhaozhuo Xu Ahmed Abbasi Heng Ji Denghui Zhang
UIUC ♡ • Google ♣ • Notre Dame ♠ • Stevens ♢

Abstract

We introduce the Poisoned Context Testbed with realistic mixed contexts and adapt a Rescorla–Wagner (RW) model to explain how competing signals steer LLMs. Even minimal inappropriate content can disproportionately degrade responses. To counter this, we propose RW-Steering, a two-stage fine-tuning that trains models to detect and discount harmful signals, reversing undesired behavior curves and improving quality. RW-Steering generalizes across contamination types, positions, and ratios, outperforming alignment and filtering baselines with strong gains under limited supervision.

Highlights

  • Behavioral InsightSmall contamination can over-steer LLMs; RW conditioning explains the dynamics.
  • Testbed — Poisoned Context Testbed with realistic mixed contexts to stress-test behavior curves.
  • Method & Results — RW-Steering reverses behavior curves and outperforms filtering/alignment baselines (~+39.8% avg. gain).
+39.8%
Avg. response quality
Robust
to minor contamination
Generalizes
Across types / positions / ratios

Methodology

Poisoned Context Testbed

A controlled benchmark pairing real queries with web/RAG-like mixed contexts that blend evidence with inappropriate content (e.g., fake news, hate speech, non-factual claims, privacy leaks). By varying contamination type, position, and ratio, we trace behavior curves and show that even minor contamination can steer answers—providing a rigorous stress test for robustness and steering methods.

Testbed Design • Mixed Contexts • Vulnerability Analysis
Figure 1: Poisoned Context Testbed

RW Model vs. LLM Behavior Curves

UP: Behavior curves predicted by our adapted Rescorla–Wagner (RW) model and the actual responses of three LLMs when exposed to two types of contextual information. As the proportion of the first type (C1) increases, the RW model's predictions closely match the LLMs' real-world outputs. Down: Behavior curves when models are exposed to disproportionate inappropriate context. Performance drops sharply when inappropriate information appears early, validating the pattern predicted by our RW model.

Modeling • Predictive Accuracy • Early Signal Impact
Figure 2: RW model vs. LLM behavior curves

RW‑Steering vs. Baselines

Our Approaches for Steering the Behavior of LLMs. Left: Three baseline approaches considered, each subject to different limitations. Right: Our RW-Steering approach. We first restructure the prompt to encourage the model to jointly optimize its judgment of inappropriate context and the generation of human-preferred answers, thereby internalizing the desired behavior. We then supplement training with examples containing a small number of inappropriate context segments to address cases where the model's internal judgment may fail.

Method Comparison • Joint Optimization • Superior Generalization
Figure 3: RW-Steering vs. baselines

Behavior Curve Reversal

Behavior curves under disproportionate mixed context (Qwen2 & Phi-2). Each plot shows response quality (0–100) vs. the proportion of inappropriate content. Left (baseline): alignment fine-tuning degrades performance relative to the base model. Middle (baseline): context filtering yields general improvement but remains unstable. Right (ours): RW-Steering reverses the undesired behavior curve and delivers more robust, generalizable performance across mixture ratios.

Curve Reversal • Stable Performance • Extreme Robustness
Figure 4: Behavior curve reversal

Results

Table 1 — Results on Models Exposed to Proportionate Fakenews Context

Consistency and Cleanliness evaluation metrics across different models and methods

Model Eval With context No context Self‑Aligned Human‑Aligned Self‑Enhanced Human‑Aligned (Awareness) Context Filtering RW‑Steering
Phi‑2Consistency66.348.582.380.777.979.875.676.2
Cleanliness53.075.579.480.682.681.458.583.9
Qwen2‑1.5BConsistency62.746.170.874.468.468.466.372.9
Cleanliness51.283.383.577.882.178.653.182.0
gemma‑2‑2bConsistency67.452.573.574.369.072.469.173.9
Cleanliness55.388.888.275.586.478.358.187.5
Llama‑3.2‑1BConsistency68.144.672.375.070.173.268.174.1
Cleanliness64.884.985.376.985.977.972.285.4

Table 2 — Results on Qwen2 Model Exposed to Disproportionate Inappropriate Context

Response Quality evaluation across contamination levels (0-45%)

Method 0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
Baseline (With Context) 74.1 61.9 59.2 54.6 57.1 59.3 54.8 57.4 56.4 56.7
Context Filtering 74.9 72.7 72.1 64.7 59.9 58.4 59.2 60.7 59.1 59.5
RW‑Steering 77.6 76.4 75.1 75.8 76.1 75.8 76.2 75.3 77.2 76.9

Table 2 (continued) — Results on Qwen2 Model Exposed to Disproportionate Inappropriate Context

Response Quality evaluation across contamination levels (50-95%)

Method 50% 55% 60% 65% 70% 75% 80% 85% 90% 95%
Baseline (With Context) 55.2 53.8 53.7 54.8 53.9 54.3 51.5 50.8 49.0 47.8
Context Filtering 61.2 61.4 59.2 62.5 58.6 61.1 57.3 55.7 54.5 55.5
RW‑Steering 75.5 76.9 76.8 76.1 78.2 76.1 76.5 74.5 74.1 76.2

Findings

01

Mixed-Context Vulnerability (Characterized & Addressed)

Using an RW (Rescorla–Wagner) model, we show that even minimal inappropriate content can disproportionately degrade LLM responses, and we introduce RW-Steering, a two-stage fine-tuning method that enables models to detect and discount harmful context, outperforming alignment and filtering baselines.

RW Model RW-Steering Vulnerability
02

Toward Robust & Safe LLMs

RW-Steering generalizes across diverse contamination and improves response quality by ~39.8%, offering a practical path to safer, more reliable LLMs and extendable to agentic settings.

Generalization +39.8% Quality Agentic Settings

Case Studies

Explore how different approaches handle mixed contexts with inappropriate content. Compare model responses across contamination levels.

Input Context with Mixed Retrieved Information:

Loading case study data...

Model Response:

Select a method to view the response...

Ground Truth:

Loading ground truth...

Citation

If you find RW-Steering useful in your research, we would appreciate it if you consider citing our work:

@article{wang2025context,
  title={Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts},
  author={Wang, Rushi and Liu, Jiateng and Qian, Cheng and Shen, Yifan and Pan, Yanzhou and Xu, Zhaozhuo and Abbasi, Ahmed and Ji, Heng and Zhang, Denghui},
  journal={arXiv preprint arXiv:2509.04500},
  year={2025}
}
×