ShortSpan.ai logo Home

Study Reveals High Leakage in Agent Conversations

Agents
Published: Mon, Nov 10, 2025 • By Rowan Vale
Study Reveals High Leakage in Agent Conversations
ConVerse benchmarks safety in agent-to-agent conversations and finds widespread risks: privacy attacks succeed up to 88% and security breaches up to 60%. The study shows stronger models often leak more and that multi-turn, plausible dialogue creates new attack surfaces, prompting urgent defence work on access control, data minimisation and auditing.

The ConVerse benchmark tests safety in conversations between autonomous agents—programs that act and converse on a user’s behalf. It evaluates how agents leak private data or misuse tools while negotiating tasks in travel, real estate and insurance.

A Large Language Model (LLM) is a statistical model trained to generate and continue text; here LLMs power the agents in the experiments.

What the benchmark does

ConVerse builds realistic multi-turn dialogues across three consumer domains with 12 user personas and 158–184 options per domain. It embeds 864 contextually grounded attacks—611 privacy and 253 security—inside plausible conversations. Privacy is measured with a three-tier abstraction taxonomy and security attacks target tool use and preference manipulation. The platform is modular, lets researchers vary agent capabilities and tools, and evaluates seven state-of-the-art models under the same interaction protocol, using a fixed judge to measure leakage and plan quality.

Key findings

Privacy attacks succeed at up to 88% attack success rate (ASR) and security breaches reach up to 60% ASR depending on the model. Across the benchmark average privacy ASR is about 64% and average security ASR about 33%. More capable models often keep strong utility while leaking more. Data that is closely related to the task leaks most: highly related items show roughly 90–94% ASR, related but private items about 77–83% ASR, and unrelated data still leaks between 58% and 72%.

Privacy targets are primarily personal, financial and healthcare data. Many attacks are subtle: they hide in plausible, institutional-sounding dialogue and emerge after several turns. Common failures include not verifying the requester, returning granular values when abstraction would suffice, and treating external prompts as ordinary retrieval instead of decision points that require boundary checks.

Practical takeaways for defenders

ConVerse reframes safety as an emergent property of communication. Fixing this requires process and controls as much as model changes. Start with simple, verifiable measures and iterate from there.

Minimum viable controls:

  • Deny-by-default tool access and require explicit justification for each tool call.
  • Tag and minimise attributes shared between agents; prefer coarse abstractions to granular data.
  • Log agent-to-agent requests and responses to support post-incident auditing and retroactive review.

Good-better-best options are practical and incremental. Good: enforceable tool-usage policies, clear retention limits and role separation. Better: runtime redaction filters for sensitive attributes, intent checks before privileged actions and automated verification steps for unclear requests. Best: authenticated service identities, tokenised or cryptographic data exchange, continuous red-teaming using multi-agent scenarios and integrated auditing that links requests to outcomes.

ConVerse does not capture every real-world nuance and uses synthetic environments, but it gives defenders a repeatable way to probe multi-agent risks. Run the scenarios against your deployments, treat cross-agent traffic as a first-class surface in threat models and accept that safety lives in the conversation, not just the model.

When agents speak to agents, the safest system is the one that limits what they are allowed to ask for, records what they did ask, and forces human checks where ambiguity or high value data are involved. Practical controls, rigorous minimisation and continuous red-teaming buy you usable safety rather than reassuring claims.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

Authors: Amr Gomaa, Ahmed Salem, and Sahar Abdelnabi
As language models evolve into autonomous agents that act and communicate on behalf of users, ensuring safety in multi-agent ecosystems becomes a central challenge. Interactions between personal assistants and external service providers expose a core tension between utility and protection: effective collaboration requires information sharing, yet every exchange creates new attack surfaces. We introduce ConVerse, a dynamic benchmark for evaluating privacy and security risks in agent-agent interactions. ConVerse spans three practical domains (travel, real estate, insurance) with 12 user personas and over 864 contextually grounded attacks (611 privacy, 253 security). Unlike prior single-agent settings, it models autonomous, multi-turn agent-to-agent conversations where malicious requests are embedded within plausible discourse. Privacy is tested through a three-tier taxonomy assessing abstraction quality, while security attacks target tool use and preference manipulation. Evaluating seven state-of-the-art models reveals persistent vulnerabilities; privacy attacks succeed in up to 88% of cases and security breaches in up to 60%, with stronger models leaking more. By unifying privacy and security within interactive multi-agent contexts, ConVerse reframes safety as an emergent property of communication.

🔍 ShortSpan Analysis of the Paper

Problem

ConVerse investigates safety in agent to agent interactions as language models evolve into autonomous agents that act and communicate on behalf of users. The work addresses the core tension between utility and protection: collaboration requires information sharing, yet every exchange creates new attack surfaces. The benchmark evaluates privacy and security risks in agent to agent dialogue across three domains travel, real estate and insurance, using 12 user personas and more than 864 contextually grounded attacks (611 privacy and 253 security). Unlike prior single agent settings, ConVerse models open ended, multi turn conversations where malicious requests are embedded within plausible discourse. It tests privacy with a three tier taxonomy assessing abstraction quality and targets security at tool use and preference manipulation. Across seven state of the art models, privacy attacks succeed in up to 88% of cases and security breaches in up to 60%, with stronger models tending to leak more. The study reframes safety as an emergent property of communication in interactive multi agent contexts.

Approach

The benchmark spans three realistic domains with 158 to 184 options per domain and uses 12 user profiles to generate planning tasks and ground truth objectives. An external service agent holds domain specific options and may be malicious or compromised. The system comprises an assistant that can access personal data and tools, a rich user data environment, and tool usage for task execution. Attacks are embedded in plausible dialogue and developed as 864 contextually grounded scenarios split into 611 privacy and 253 security attacks. Privacy is evaluated through a three tier abstraction based taxonomy and security targets include abuse of tools and manipulation of preferences. The platform is modular, enabling manipulation of agent capabilities, tools, task environments and evaluation procedures. Seven state of the art models are evaluated under identical interaction protocols, including GPT 5, Claude Sonnet 4 0, Claude Haiku, Gemini 2 5 Flash and Pro, O3 mini and Grok 3, with GPT 5 used as a fixed judge to measure leakage and plan quality. Ground truth ratings for planning options are pre generated to assess how well agent selections align with user preferences, informing a coverage and average rating metric for utility.

Key Findings

  • Model safety in multi agent settings is fragile: privacy ASR ranges across models from 37% to 88%, while security ASR ranges from 2% to 60%, with greater leakage observed in more capable models.
  • In domain analysis, larger models exhibit higher leakage but often maintain strong utility; for example in travel planning, GPT 5 attains a privacy ASR of 84.68% with a utility rating around 8 and high coverage, while Claude Sonnet 4 0 shows 72.89% ASR with similarly high utility and coverage; Gemini 2 5 Flash shows a lower leakage of 37.91% but reduced utility.
  • Proximity to the domain drives leakage: data that is related and useful shows the highest leakage at about 90% to 94% ASR, related but private data shows 77% to 83% ASR, and unrelated data still leaks between 58% and 72% ASR.
  • Privacy attacks predominantly target personal, financial and healthcare data, with personal information accounting for roughly 31%, financial 22%, healthcare 17%, and business 15%; government ID, travel history and calendar data occur at smaller but non trivial rates.
  • Overall privacy attacks outperform security ones: average privacy ASR is about 64% while average security ASR is about 33%, indicating information leakage is harder to prevent than preventing unwanted actions.
  • Security recommendations vary by model: Claude Sonnet 4 0 achieves the lowest security ASR around 10.7% with strong utility, while GPT 5 shows high utility but higher security leakage around 60% ASR; email manipulation yields high ASR in several domains, and denial of service attacks can be effective in insurance contexts.
  • Attacks frequently rely on institutional language and emerge after several turns; failures include not verifying information requests, disclosing granular data when abstraction would suffice, and treating external prompts as open ended data retrieval rather than boundary decisions.

Limitations

The benchmark relies on synthetic data and simulated environments which may not capture all real world complexities. Ground truth is generated and evaluated using automated matchers which may not reflect every human judgement. The platform currently focuses on text based interactions and does not address multimodal information or full verification against trusted external sources. Future work may include more complex ecological validity, authentication mechanisms, and more diverse cultural privacy norms.

Why It Matters

ConVerse reveals serious attack surfaces in autonomous, multi agent AI systems where agents share tools and data. The study documents substantial privacy leakage and security breaches even for strong models, highlighting risks in real world deployments. It offers a dynamic benchmark with a taxonomy of privacy and security attacks across practical domains to support threat modelling, red teaming and the design of defenses such as stricter tool usage policies, data minimisation, access controls, auditing and monitoring of multi agent dialogue. The societal angle emphasises governance and data rights as these ecosystems negotiate information exchange in consumer domains such as travel, real estate and insurance.


← Back to Latest