How Evidence-Carrying Agents Reinforce AI Safety

AI systems are evolving, yet they face a critical challenge: ensuring actions based on visual inputs are authorized and safe. Multimodal agents that interpret screenshots, documents, and webpages often encounter a significant risk, hallucination leading to unauthorized actions. This failure mode, termed hallucination-to-action conversion, involves false perceptual claims enabling unauthorized actions.

Evidence-Carrying Approach

The introduction of evidence-carrying multimodal agents (ECA) marks a key advancement. Unlike traditional models that might rely on unverified model text, ECA demands concrete evidence before proceeding with actions. This is achieved through a meticulous process that decomposes each tool call into action-critical predicates. Typed certificates from verifiers such as DOM, OCR, and AX are obtained, ensuring only verified actions are authorized. Consequently, this architecture doesn't conceal perception errors but rather reveals them through named verifier outputs, schemas, and implementation residuals.

Reducing Unauthorized Actions

In rigorous testing, including 1,900 targeted attacks, ECAs demonstrated remarkable resilience. The introduction of strategic hardening steps reduced gate bypass rates from 15% down to 1.3%. Furthermore, content-derived certificates maintained a 0% unsafe-action rate over a 200-task pipeline, with an upper bound of 2.67%. Similarly, in a 120-task browser test, the upper bound remained at 4.3%. These numbers highlight a significant leap in AI safety.

Implications for AI Safety

But why should developers and industry stakeholders care about this shift? The impact is profound: unsupported action claims reaching unsafe execution dropped dramatically. In a direct audit involving 500 stratified task keys, naive agents had a 100% unsafe execution rate, compared to the ECA's impeccable record. This raises a critical question: can other AI systems afford not to adopt such rigorous verification measures?

While traditional neural judge baselines could be bypassed under similar threat models, ECA's principle stands out: model language can propose actions, but external evidence must authorize them. This approach challenges the status quo and sets a new standard for AI reliability and safety.

How Evidence-Carrying Agents Reinforce AI Safety

Evidence-Carrying Approach

Reducing Unauthorized Actions

Implications for AI Safety

Key Terms Explained