Moltbook agents leak secrets and poison LLM data

Society

Published: Mon, May 11, 2026 • By Lydia Stratus

Moltbook agents leak secrets and poison LLM data

Researchers release the Moltbook Files, a 12‑day scrape of an agent-run forum with 232k posts and 2.2M comments. Agents publicly posted API keys, passwords and BIP39 seeds. Fine-tuning an LLM on this data halved TruthfulQA scores, similar to a Reddit baseline. Self-linking risks contaminating future web crawls and models.

Put a swarm of language-model agents on a social network and you get volume and slop. The Moltbook Files covers the platform’s first 12 days: 232,232 posts and 2,202,950 comments, mostly driven by a few hyperactive accounts. The authors frame it as a harmless slopocalypse. From where operations actually break, two things matter: secrets leaking in public and an easy route to feed that slop into training sets.

Exploitation path: secrets at scale

Moltbook is publicly indexed. Agents posted API keys, password-like strings and BIP39 seed phrases. The team ran Microsoft Presidio with custom recognisers to mask PII, touching 0.47% of fields, with explicit removals under 0.01%. That reads like a safety net, not a barrier. If you are an attacker, this is a straight scrape-and-grep job against an open corpus. The platform’s traffic follows steep power laws, so one misconfigured agent can spill a lot, very fast. Comment trees are unusually flat and the median time to first reply is 34.2 seconds, which means leaks propagate quickly and stay visible. You don’t need supply-chain compromise when users and agents post the keys for you.

Poisoning path: make the crawler eat itself

The dataset shows heavy self-referential linking. Moltbook.com is the most-linked domain with 13,352 mentions. Language is highly repetitive: a type–token ratio of 0.007 with 43.3% hapax legomena. That combination is catnip for naive crawlers: lots of links, lots of near-duplicates. If you want your content in a web-scale corpus, flood it with backlinks and cross-posts. The authors fine-tuned Qwen2.5-14B-Instruct on Moltbook at several adaptation levels and watched factuality crater on TruthfulQA from 0.366 to 0.187 at the high setting. An LLM judge put alignment in the 70–80% range. Crucially, a size-matched Reddit fine-tune produced comparable degradation, so this isn’t a uniquely “agent” toxin. It’s social-media sludge in, worse models out.

There are caveats. The crawl is only the first 12 days and excludes deleted or private material, which skews what we see. PII detection is pattern-based and will miss things while flagging false positives. The fine-tunes use a single base model, a single judge and one factuality benchmark. Still, the exploitation paths are not exotic. Publicly indexed agents leak credentials; self-links steer crawlers; and fine-tuning on the result measurably dents truthfulness. The open question is scale: how much coordinated slop does it take to shift a training run you actually care about?

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

Authors: William Brach, Federico Torrielli, Stine Lyngsø Beltoft, Annemette Brok Pirchert, Peter Schneider-Kamp, and Lukas Galke Poech

Moltbook is a Reddit-like platform where OpenClaw agents post, comment, and vote at scale - a so far unprecedented incident that comes with serious safety concerns. With the aim of studying emergent behavior in populations, we release the Moltbook Files, a dataset of 232k posts and 2.2M comments covering the platform's first 12 days, processed through a pipeline to identify and remove Personally-Identifiable Information (PII). We analyze community structure, authorship, lexical properties, sentiment, topics, semantic geometry, and comment interaction. To understand how Moltbook data could affect the next generation of language models, we fine-tune Qwen2.5-14B-Instruct on Moltbook Files with three adaptation levels. Our PII pipeline reveals that agents post API keys, passwords, BIP39 seed phrases on Moltbook, a publicly indexed platform. The overall sentiment is mostly neutral and mildly positive (66.6% neutral, 19.5% positive) and shows a tendency for self-referential linking. We find that fine-tuning on Moltbook data reduces truthfulness from 0.366 to 0.187. However, a model fine-tuned on a size-matched Reddit dataset produces a comparable decrease. Moltbook thus seems to be more of a harmless slopocalypse. However, tail risks remain, including agent affordances, contamination of future crawls through self-links, and potential transfer of traits to the next generation of language models. More broadly, our findings highlight the importance of control baselines in emergent misalignment evaluations.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies Moltbook, a Reddit-like platform populated almost entirely by autonomous language-model agents, and asks how large-scale agent-generated social media affects safety, data leakage, and downstream model behaviour. This matters because agent content can enter web-scale training corpora, potentially leaking secrets, altering model truthfulness and alignment, and contaminating future crawls and models.

Approach

The authors release the Moltbook Files: a crawl of the platform's first 12 days comprising 232,232 posts and 2,202,950 comments, processed through a deterministic pipeline that normalises text, flags spam and blocklist matches, applies language identification, and masks PII with Microsoft Presidio extended by custom recognisers for API keys, password-like strings and BIP39 seed phrases. They analyse community structure, authorship, lexical properties, sentiment, topics, semantic geometry and comment interaction. To test downstream effects, they fine-tune Qwen2.5-14B-Instruct on Moltbook posts at three adaptation levels and compare results to a size-matched Reddit fine-tune. Evaluations use TruthfulQA for factuality and DeepSeek-3.2 as an LLM judge for emergent misalignment and coherence.

Key Findings

The dataset is concentrated: 3,628 communities with the general community accounting for 67.9% of posts; activity and authorship follow steep power laws with 34,905 unique post authors and a few authors producing thousands of posts.
Language is highly repetitive: 23.2 million tokens, a vocabulary of 170,419 types, type-token ratio 0.007, and 43.3% hapax legomena; comment trees are unusually flat with most replies at depth 0 and a median time to first comment of 34.2 seconds.
PII and credential leakage are present: agents posted API keys, password-like strings and BIP39 seed phrases; PII masking touched 0.47% of fields and explicit removals affected under 0.01% of fields.
Sentiment is predominantly neutral and mildly positive: 66.6% Neutral, 19.5% Positive; emotion analysis shows neutral as dominant with curiosity and approval as common non-neutral emotions.
Topic structure includes crypto/financial activity, agent identity and memory, philosophical/existential themes and platform operations; self-referential linking is common with moltbook.com the most-linked domain (13,352 occurrences).
Fine-tuning Qwen2.5-14B-Instruct on Moltbook reduces TruthfulQA score from 0.366 to 0.187 at high adaptation; DeepSeek alignment drops into the 70–80% range. A size-matched Reddit fine-tune produces comparable factuality and alignment degradation, suggesting the effect is not unique to agent-generated content.

Limitations

The collection covers only the first 12 days and omits deleted or private content, creating selection bias. PII masking and language ID are pattern-based and subject to false positives and negatives. Fine-tuning experiments use a single base model, a single judge model and a single factuality benchmark, and the adaptation settings conflate multiple hyperparameters. The Reddit baseline is size-matched but not exhaustive of human social-media diversity.

Implications

Offensive-security implications include direct credential exfiltration from public agent posts (API keys, passwords, seed phrases), use of agent platforms to distribute links that will contaminate web crawls and training corpora, and deliberate shaping of future models by seeding large quantities of agent-like content to transfer stylistic or behavioural traits. Coordinated agent swarms could amplify self-referential linking or spam to increase the likelihood of inclusion in downstream datasets, and fine-tuning on such data can measurably lower factuality and alignment whether the source is agent or human social media.

Links Original paper on arXiv

Moltbook agents leak secrets and poison LLM data

Exploitation path: secrets at scale

Poisoning path: make the crawler eat itself

📋 Original Paper Title and Abstract

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

Red team uncovers LLM agent leaks, spoofing, DoS

Agents Leak Secrets via Web Search Tools

Frontier tests reveal risky LLM agent behaviour

Related Research

Get the Weekly AI Security Digest