AI Pulse
← Feed · 2026-06-06 · Safety & Interpretability

Cracks in the Foundation: What New Research Reveals About Model Control and Representation

Four recent papers collectively challenge assumptions that underpin model control: steering is less stable than linear theory predicts, linear probes reveal task format rather than reasoning type, backdoor defenses can generalize beyond known triggers, and preference-based training extends well past chat alignment.

Cracks in the Foundation: What New Research Reveals About Model Control and Representation

The past few weeks have brought a cluster of results that should unsettle anyone who assumed the interpretability and control toolbox was more or less settled. Each paper attacks a different assumption, but together they sketch a picture of representations that are messier, richer, and harder to manipulate cleanly than mainstream accounts suggest.

Steering Is Unstable Because Representations Are Cylindrical, Not Linear

The dominant theoretical frame for activation steering has been the Linear Representation Hypothesis (LRH): concepts occupy orthogonal linear subspaces, so you can add a direction vector and cleanly shift behavior. (The Cylindrical Representation Hypothesis for Language Model Steering) argues this idealization breaks down in practice. The paper proposes that concept representations sit on cylindrical manifolds rather than orthogonal linear subspaces. When the LRH orthogonality assumption fails, steering interventions leak into adjacent concept dimensions, which explains the well-documented instability where a small steering vector triggers cascading unintended effects. The cylindrical hypothesis does not just describe the failure mode; it provides a geometric account of why lossless steering is impossible under real model representations. Practical implication: steering techniques designed under LRH need rethinking, and evaluation frameworks should measure cross-concept leakage explicitly.

Linear Probes Read Task Format, Not Reasoning Type

A separate paper throws cold water on a popular interpretability claim: that probing hidden states can identify whether a model is engaging in deductive, inductive, or abductive reasoning. (Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States) probed Qwen3-14B on LogiQA 2.0, ARC-Challenge, and alphaNLI, benchmarks typically used to represent the classical reasoning trichotomy. Linear probes at layer 32 achieved 100% cross-validated accuracy with well-separated geometry. Impressive, until you examine what they actually learned: task format and surface features, not distinct reasoning representations. The geometry that looks like separate reasoning modes turns out to be latent task identity. This matters because much interpretability work draws conclusions about model cognition from exactly this kind of probe performance. If the signal is format, not reasoning, those conclusions require revisiting.

Backdoor Unlearning Generalizes Across Unknown Triggers

On the defense side, (Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs) addresses a structural asymmetry in model security: existing defenses require knowing the backdoor trigger, leaving defenders exposed when a model may contain triggers they have not identified. The paper demonstrates that training a model to ignore a single known trigger generalizes, and the unlearning transfers to unknown backdoors the defense was never trained against. This is a meaningful practical advance. It shifts the problem from enumerating every possible trigger to demonstrating unlearning on a representative sample. The mechanism is not fully characterized yet, but the empirical result is robust enough to be operationally useful for organizations conducting model audits.

DPO Extends Beyond Conversational Alignment

Finally, (Direct Preference Optimization Beyond Chatbots) surveys applications of DPO outside the chat domain where it was popularized. The central finding is that DPO preference-based optimization is domain-agnostic: it has been applied effectively to code generation, structured prediction, and retrieval tasks. This matters for safety because it means the same training machinery used to align conversational models can be applied to any agent task with a preference signal. The implication cuts both ways. DPO can reinforce desired behaviors broadly, but misaligned preference signals will generalize just as readily.

Synthesis

These four results converge on a common theme: the tools we use to understand and control models are more fragile than advertised. Steering vectors leak. Probes confound format with cognition. Backdoor defenses can now generalize, but only because the underlying unlearning mechanism is itself underspecified. Preference training is powerful and domain-general, amplifying both benefits and risks. The field is better equipped for control and defense than it was six months ago, but the theoretical foundations need to catch up with empirical practice.

Sources

Sources