ShortSpan.ai logo

OCELOT Budgets Inference Leakage in LLM Agents

Agents
Published: Thu, Jun 11, 2026 • By Theo Solander
OCELOT Budgets Inference Leakage in LLM Agents
OCELOT controls privacy in LLM agents by budgeting how much an adversary’s belief may improve across a trajectory. An untrusted defender proposes declassifications; a verifier audits cost in min-entropy and records it on a tamper-evident ledger. Results show lower cumulative leakage, strong resistance to injection and collusion, and modest runtime overhead.

Agentic Large Language Models (LLMs) do handy things like read your files, call tools, and talk to external services. They also leave a trail of crumbs. The privacy risk is not any single reply, but the whole journey: small, innocuous disclosures combine; inbound content can plant instructions that shape later outputs; and what is necessary for one recipient is gratuitous for another.

How OCELOT works

OCELOT recasts this as posterior-risk control. Instead of a filter that guesses what to block, it budgets how much an adversary’s belief about a protected secret is allowed to improve, measured in entropy" target="_blank" rel="noopener" class="term-link">min-entropy bits, across the entire trajectory. Every candidate release is inspected by an untrusted, locally fine-tuned defender model that emits structured evidence: which spans matter, which are necessary for the task, and which declassification operators might be applied. A deterministic verifier then audits that evidence under a rubric, charges a certified min-entropy cost calibrated against proxy adversaries, and authorises the least-disclosing useful variant. Sinks are trust-weighted, and every charge lands on a tamper-evident Merkle-chained ledger.

The operator set is simple but expressive: identity, generalise, substitute, redact, drop. When the defender hedges or mislabels, the verifier penalises via a max-over-fields rule and falls back to conservative operators. If the budget is exhausted, disclosure stops. The mechanism keeps semantic judgement and trust separate: the model can be crafty; the verifier is uncompromising.

What the results say

Across five benchmarks and five runtime defences, OCELOT delivers the lowest cumulative min-entropy leakage while keeping task success highest. On its main benchmark the median leakage is about 0.31 bits with roughly 91.3 percent task success. Ledger arithmetic guarantees budgets are not exceeded, with zero exceedances across 22,585 authorised trajectories. Robustness improves too: inbound prompt injection succeeds 3.8 percent of the time versus 8.1 to 67.4 percent for baselines; cumulative-inference recovery drops to 4.2 percent top-1 versus 27.3 to 64.8 percent; and under colluding sinks recovery rises from 1.2 percent at one sink to 11.9 percent at eight, while baselines degrade more sharply. Mediation overhead is modest: median per-release around 204 ms, p99 near 736 ms, about 18 percent wall-clock.

If this all feels familiar, it should. Security has long favoured accounting over wishful filtering: pay for what you disclose, keep the books straight, and prefer the smallest necessary release. OCELOT gives agents that discipline. It is not magic. Certified costs only upper-bound true leakage if calibration holds against stronger adversaries; rubrics can miss cases, which triggers conservative fallbacks. The main evaluation leans on a synthetic benchmark, though results transfer to held-out categories and a human-labelled real-trace set. The interesting question now is operational: who sets the trust weights, how tight should the budget be, and what constitutes a tolerable bit of privacy spent in production?

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents

Authors: Jin Xie and Songze Li
Large language model (LLM) agents increasingly act on a user's behalf -- reading personal files, calling tools, transacting with external services -- possibly leaking personally identifiable information (PII) across trust boundaries at every step. Privacy here is a property not of a single output but of an entire trajectory, and three properties make it hard: leakage is cumulative, as individually innocuous releases accumulate across honest-but-curious or colluding sinks into inferences about a protected secret; bidirectional, as a malicious observation can inject instructions that turn the agent's own reasoning model against the user; and task-dependent, as the same field is necessary for one recipient yet gratuitous for another. Per-release contextual-integrity filters, information-flow controls, and posterior-leakage monitors each address part of this but none controls cumulative, inference-based leakage at runtime. We recast agent privacy as \emph{posterior-risk control} and present OCELOT, a runtime mediator that budgets how much an adversary's belief about a secret may improve across a trajectory, rather than filtering outputs. Its mechanism, \emph{Witness-Verified Declassification}, separates judgment from trust: an untrusted, locally fine-tuned defender model inspects each candidate release and emits structured evidence -- labeled atoms and proposed declassification operators -- which a deterministic verifier audits, charging a certified min-entropy cost for the chosen variant and authorizing the least-disclosing useful release under a sink-trust-weighted budget recorded on a tamper-evident ledger. Across diverse agent benchmarks and recent defenses, OCELOT attains significantly lower leakage at higher task utility, resists adaptive injection, jailbreak, cumulative inference, and sink collusion, and adds only modest overhead.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies privacy risk in large language model agents that act on users' behalf by reading local files, calling tools and transacting with external services. Privacy here is a property of an entire trajectory of interactions rather than of any single output. Three characteristics complicate mediation: cumulative leakage where individually innocuous releases combine to reveal a secret; bidirectional leakage where incoming observations can inject instructions that later increase disclosure; and task dependence where a field may be necessary for one recipient but gratuitous for another. Existing per-release filters, information-flow controls and posterior estimators each address parts of the problem but none controls cumulative, inference-based leakage at runtime.

Approach

OCELOT is a runtime mediator that budgets how much an adversary's posterior over a secret may improve across a trajectory, reframing agent privacy as posterior-risk control. Its core mechanism, Witness-Verified Declassification, separates semantic judgement from trust: an untrusted, locally fine-tuned defender LLM emits structured evidence (annotated spans, necessity labels and a proposed sequence of declassification operators), while a deterministic verifier audits that evidence, computes certified min-entropy costs per operator from calibrated bounds, and enforces the least-disclosing authorised variant under a sink-trust-weighted per-trajectory budget recorded on a tamper-evident Merkle-chained ledger. Costs are calibrated as upper-confidence bounds on per-instance min-entropy advantage using a proxy adversary pool and are audited with a credal posterior engine. The rubric-driven verifier enforces coverage and necessity obligations, runs two-pass consistency challenges on the defender, applies a max-over-fields rule to penalise mislabelling, and escalates or falls back to conservative operators (identity, generalize, substitute, redact, drop) when budget or checks fail. The defender is trained by supervised fine-tuning on synthetic rubric-applied data and refined with a Group Relative Policy Optimisation reward computed using the verifier itself.

Key Findings

  • OCELOT enforces deductive budget non-exceedance: zero budget exceedances across 22,585 authorised trajectory instances, by ledger arithmetic alone.
  • Across five benchmarks and five recent runtime defences, OCELOT attains the lowest cumulative min-entropy leakage and highest task success on every benchmark; on its primary benchmark median leakage is about 0.31 bits with task success ≈91.3%.
  • The mechanism improves robustness: it resists inbound injection (attack success 3.8% versus 8.1–67.4% for baselines), limits cumulative-inference recovery (top‑1 recovery 4.2% versus 27.3–64.8%), and degrades gracefully under colluding sinks (recovery rises from 1.2% at one sink to 11.9% at eight compared with much larger increases for baselines).
  • Cost calibration holds well in-distribution (96.6% coverage) and degrades modestly on a stronger held-out stress pool (92.1%); a verifier audit and calibration margin restore post-audit coverage to about 95.3%.
  • Component ablations show the operator algebra plus ledger deliver most of the privacy–utility frontier shift; the learned defender adds precision. End-to-end overhead is modest: median per-release mediation about 204 ms and end-to-end p99 about 736 ms, ≈18% wall‑clock overhead.

Limitations

The deductive budget guarantee is independent of model correctness, but the semantic reading of that budget is model-relative: certified costs upper-bound true adversarial leakage only to the extent calibration transfers to stronger adversaries. Rubrics are validated but not provably complete; incompleteness causes conservative fallback rather than unsafe release. Evaluation uses a synthetic primary benchmark to obtain ground truth, though results transfer to held-out rubrics, categories and a human‑labelled real‑trace set.

Implications

An attacker may still exploit gaps where calibration fails or where rubric coverage is incomplete, but OCELOT shows a practical path for attackers to be limited: cumulative inference and collusion become costly to exploit because every release is accounted and charged in min-entropy bits against a bounded budget, and injected content cannot silently subvert releases without triggering conservative defaults. The mediator architecture suggests that untrusted models can be retained for semantic recognition while deterministic verifiers and auditable ledgers enforce quantitative, trajectory-level privacy limits.


Related Articles

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.