Latest AI News

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

arXiv:2605.20241v1 Announce Type: cross Abstract: Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular, it remains unclear how safety evidence is formed across layers, which aspects of that layer-wise geometry support low-false-positive decisions, and which geometric biases remain stable under benchmark shift. We study this as an empirical decomposition problem and introduce Geometry-Lite, a compact prompt-level probe that maps each layer's final prompt-token representation to signed margins under centroid, local-neighborhood, and supervised linear-boundary readouts, then summarizes the resulting margin profiles by boundary position, layer-to-layer change, and coarse shape. Across nine instruction-tuned backbones ($1.2$B--$70$B) and seven safety benchmarks, Geometry-Lite improves over single-layer probes while remaining close to raw multi-layer score stacking, making it a useful instrument for analyzing the multi-layer safety signal. The decomposition shows that safety evidence is expressed primarily through persistent boundary-position geometry: final or extremal margins and unsafe-side layer occupancy dominate aggregate detection performance. In contrast, finite-difference drift and structural summaries add little to pooled AUROC, although drift can provide small recall-oriented corrections under shifted low-FPR thresholds. Under benchmark shift, optimized linear boundaries are sharp on the training mixture, whereas class-conditional mean geometry retains separation more reliably on a predefined hard held-out subset. Overall, prompt-level safety evidence is not primarily a layer-to-layer motion signal, but a persistent layer-wise margin geometry whose useful components and readout-level biases become visible in decision-critical regimes.

Latest News

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

Latest News

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

Leveraging Vision-Language Models to Detect Attention in Educational Videos

ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data

Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine

LEAP: A closed-loop framework for perovskite precursor additive discovery

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

Instance Discrimination for Link Prediction

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

DIVE: Embedding Compression via Self-Limiting Gradient Updates

PACD-Net: Pseudo-Augmented Contrastive Distillation for Glycemic Control Estimation from SMBG

The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?

PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

DeCoR: Design and Control Co-Optimization for Urban Streets Using Reinforcement Learning

One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers

Equilibrium Propagation and Hamiltonian Inference in the Diffusive Fitzhugh-Nagumo Model

AgForce Enables Antigen-conditioned Generative Antibody Design