How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing
arXiv:2603.13259v2 Announce Type: replace-cross Abstract: When a decoder-only transformer is forced to process matched correct and incorrect single-token continuations of a factual query, the two pathways through hidden-state space diverge in a specific way: displacement vectors from the query-only representation maintain approximately equal magnitude but rotate apart in direction. The angular separation grows through mid-depth, and late layers resolve the asymmetric outcome -a logit-lens preference that, in the incorrect run, falls far below the naive prior of equal probability, corresponding to the model assigning approximately 11.5 times more probability to the incorrect token than to the correct one. We characterize this two-phase pattern-rotational divergence in mid-depth followed by late-layer asymmetric commitment-as the empirical geometric signature of what looks externally like the model rejecting a wrong continuation, while remaining explicit that it is an observational characterization, not a causal account. The pattern is consistent across six decoder-only transformers including five architecture families from 1B to 13B parameters. A seventh model (Qwen2 1.5B) shows a flat profile under the present extraction protocol that is plausibly a tokenizer-fragmentation artefact rather than a real scale floor; the question of an emergence threshold is left open. Single-layer activation patching does not recover the correct token at any layer band, meaning the late-layer asymmetry is not localized to a discrete component under the protocol used. Taken together, the evidence is consistent with a distributed-by-trajectory account of factual constraint processing-geometric structure that emerges cumulatively across many layers rather than from a single localized circuit and inconsistent with the simplest single-layer localized-recall account.
