Proxy scoring catches multi-turn LLM prompt injections
Defenses
Most proxy defences treat Large Language Model (LLM) chats as a series of isolated turns. Attackers do not. They drip-feed instructions across a dialogue until the model obliges. This paper asks a simple question with operational bite: how should a proxy combine per-turn signals into a conversation risk score without calling an LLM?
The authors start by taking a swing at the intuitive answer, a weighted average over turns. They are right to: an average collapses a 20-turn grind into the same score as a single suspicious blip. Persistence vanishes in the maths. Their remedy is a small, opinionated formula they call peak + accumulation. It blends three signals: the peak single-turn risk, a persistence ratio that rewards repeated matches, and category diversity that boosts attacks spanning different pattern types.
Concretely, each user or tool message is scanned against five regex categories with weights. A turn score is the clamped sum of the matched weights. The accumulation is controlled by a persistence factor rho and a diversity factor delta, with small bonuses for escalation and resampling events. The implementation is proxy-friendly by design: deterministic, fast, and auditable. It ships in Parapet, an open-source Rust HTTP proxy firewall, with a default block threshold at 0.7, rho at 0.45, and delta at 0.15.
On 10,654 multi-turn conversations drawn from WildJailbreak (attacks) and WildChat (benign), the paper reports 90.8% recall at 1.20% false positive rate, with an F1 of 85.9%. A sensitivity sweep shows a phase change around rho ≈ 0.4 where recall jumps by about 12 percentage points while false positives barely move. Sub-millisecond scoring at the proxy is the selling point: a cheap, deterministic layer before any LLM-based reanalysis.
Is this a breakthrough? No. It is a tidy fix to a scoring pitfall that should never have survived this long. But it matters because it lands where many teams actually need help: the API gateway, where you cannot afford a model-in-the-loop on every message and you must explain why you blocked a session.
The limits are familiar and worth underlining. This is regex territory. If an attacker rephrases, encodes, or tiptoes with indirect language, patterns will miss. The authors admit it cannot spot topic trajectory attacks that climb innocently and only bite late; they name Crescendo-style cases as a miss. The evaluation also assembles synthetic multi-turn sequences from public datasets rather than using real conversational traces from production. That inflates uncertainty about how it behaves under messy, domain-specific traffic. False positives are driven by benign phrases that trip regexes; the paper reports a low single-digit percent rate, but even around one in a hundred benign conversations flagged can be noisy at scale.
Where it earns its keep is as a first-pass filter and a triage signal. If you run an LLM proxy, this gives you a transparent, reproducible way to weight persistence and cross-category evidence without paying inference costs. It will not replace semantic classifiers, but it can reduce their load and provide an auditable trail for incident review. The open-source release of the formula, pattern library, and harness is the right move to let practitioners test and tune on their own logs.
Practical notes, since this one actually lends itself to deployment: tune the persistence parameter near the reported sweet spot; start in monitor-only mode to collect false positives; and prune or adjust patterns that collide with your domain jargon. Treat it as one layer in defence in depth, not the last word on prompt injection.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies proxy level detection of multi turn prompt injection attacks, where malicious intent is distributed across conversation turns. Since proxy firewalls must decide without invoking an LLM, there is a need to aggregate per turn pattern scores into a conversation level risk score without model based re analysis. A weighted average of per turn scores is shown to have a fundamental flaw: it converges to the per turn score regardless of the number of turns, making a twenty turn persistent attack indistinguishable from a single suspicious turn.
Approach
The authors propose peak plus accumulation scoring a formula that combines peak single turn risk persistence ratio and category diversity. The approach draws on change point detection Bayesian belief updating and security risk based alerting. Each user or tool message is scanned against a set of five regex pattern categories with weights; a turn score is the sum of matched category weights clamped to [0 1]. A persistence factor rho and diversity factor delta govern how evidence accumulates across turns, with escalation and resampling bonuses applied under specific conditions. A threshold tau of 0 7 blocks conversations. The method is designed to be proxy computable deterministic fast and auditable and implemented in Parapet an open source Rust HTTP proxy firewall. The default configuration uses rho 0 45 delta 0 15 escalation bonus 0 2 and a resampling bonus 0 7. A phase transition is observed around rho 0 4 with recall rising sharply with only small changes in false positives.
Key Findings
- Across 10 654 multi turn conversations including 588 attacks from WildJailbreak prompts and 10 066 benign conversations from WildChat the peak plus accumulation score achieves 90 8 recall at 1 20 false positive rate with an F1 of 85 9.
- Nine handcrafted attacks spanning categories such as instruction seeding role confusion deferred authority escalation probing and resampling are all detected by the new scoring.
- Among 579 WildJailbreak attacks 525 are detected ( 90 7 ). Fifty four false negatives occur mainly when attacks rely on single category patterns with low weight or benign opening turns suppress pattern matches. One hundred twenty one false positives arise from WildChat conversations containing matching regex phrases which could be mitigated by pattern refinement.
- Zero false positives for sparse benign cases where one injection occurs among benign turns, and 81 5 precision indicating strong specificity in production style use with a 1 20 FPR.
- The persistence parameter rho 45 provides the best balance with recall 90 8 FPR 1 20 and F1 85 9; there is a notable phase transition near rho 4 where recall increases substantially with minimal FPR change.
Limitations
Proxy level regex based detection cannot identify topic trajectory escalation a limitation observed with Crescendo style attacks which use innocuous language. The evaluation relies on synthetic multi turn sequences assembled from available datasets rather than publicly curated real multi turn injection traces. Pattern based approaches are susceptible to evasion through re phrasing encoding tricks or indirect phrasing, underscoring that improvements in pattern robustness complement the scoring method rather than replace it. The approach provides one layer in a defence in depth architecture and does not replace LLM based classification which remains more capable for semantic analysis.
Why It Matters
The work delivers the first published proxy level multi turn scoring formula that does not require LLM inference. It demonstrates that accumulation signals rather than simple averaging best capture persistent distributed attacks enabling faster deterministic decisions with sub millisecond latency. The method supports scalable deployment at the proxy layer enabling more effective protection for LLM API gateways and other chat based interfaces. The authors also release the scoring algorithm pattern library and evaluation harness as open source enabling practical adoption and community contribution.