Agents bypass CAPTCHAs by reasoning steps

Agents

Published: Wed, Oct 08, 2025 • By Lydia Stratus

New research shows vision-language agents that perform step-by-step reasoning can solve many real-world CAPTCHAs. Commercial models score about 21.9% without reasoning; an agentic framework reaches 83.9% on a 1,839-puzzle CAPTCHA-X benchmark. The result exposes a practical vulnerability in automated human-verification systems and urges tougher defences.

A new study evaluates how vision-language models handle spatial reasoning by testing them on a dataset called CAPTCHA-X. The work finds that off-the-shelf commercial models perform poorly on difficult CAPTCHA tasks unless they are forced to offer step-by-step reasoning. Plainly put, asking the model to think aloud before it clicks boosts its ability to solve puzzles that were designed to separate humans from bots.

What the researchers did and found

CAPTCHA-X contains 1,839 real-world puzzles across seven categories and pairs each challenge with ground-truth click regions and stepwise action traces. Without structured reasoning, commercial vision-language models score roughly 21.9 percent. When the authors build an agentic pipeline that breaks the task into mapping, stepwise inference and action execution, average solving accuracy rises to about 83.9 percent across categories. The paper also defines five metrics to measure reasoning quality: Reasoning Steps, Reasoning Length, Reasoning Score, Reasoning Efficiency and Trajectory Complexity Index, and shows strong correlations between those metrics and solving accuracy.

The agentic pipeline is modular. It first classifies the CAPTCHA type, then either maps the visual input to a symbolic grid or engages a spatial understanding module for non-grid tasks. A Reasoning Steps Generator produces structured chain-of-thought style steps, a Discriminator checks coherence, an Action Generator translates the plan into coordinates and an Action Executor performs clicks. That combination substantially improves localisation and success rates.

Security implications and a diagram in words

The headline is simple and worrying: asking a model to reason can turn a hard human-verification problem into an automatable one. Diagram in words: CAPTCHA image arrow mapping to symbolic grid arrow step-by-step reasoning arrow coordinates arrow automated click. If an attacker runs or orchestrates a VLM agent that mimics these stages, many current CAPTCHAs become ineffective.

That does not mean all CAPTCHAs are broken, but it means defenders can no longer assume visual puzzles are inherently safe. The authors recommend moving to non-deterministic, context-rich or multi-step verification tasks and investing in adversarial testing that treats models as capable attackers. They also note the need to develop detectors that spot model chain-of-thought patterns, though detection is an arms race.

Immediate practical steps for time-pressed teams

Run CAPTCHA-X tests against your verification flows to measure realistic model bypass risk.
Add non-determinism or multi-step interactive checks that increase reasoning cost for an attacker.
Instrument client behaviour and limit automated replay by checking timing, pointer trajectories and session context.

Finally, treat this as part of threat modelling. Prioritise systems that rely on CAPTCHAs for fraud prevention, and include CAPTCHA resistance in pen tests. The research gives a ready benchmark and metrics to evaluate defences; use them to quantify how much more effort an attacker needs to break your checks rather than relying on hope and obscurity.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA

Authors: Python Song, Luke Tenyi Chang, Yun-Yun Tsai, Penghui Li, and Junfeng Yang

CAPTCHA, originally designed to distinguish humans from robots, has evolved into a real-world benchmark for assessing the spatial reasoning capabilities of vision-language models. In this work, we first show that step-by-step reasoning is crucial for vision-language models (VLMs) to solve CAPTCHAs, which represent high-difficulty spatial reasoning tasks, and that current commercial vision-language models still struggle with such reasoning. In particular, we observe that most commercial VLMs (e.g., Gemini, Claude, GPT, etc.) fail to effectively solve CAPTCHAs and thus achieve low accuracy (around 21.9 percent). However, our findings indicate that requiring the model to perform step-by-step reasoning before generating the final coordinates can significantly enhance its solving accuracy, underscoring the severity of the gap. To systematically study this issue, we introduce CAPTCHA-X, the first real-world CAPTCHA benchmark with reasoning, covering seven categories of CAPTCHAs (such as Gobang, hCaptcha, etc.) with step-by-step action solutions and grounding annotations. We further define five reasoning-oriented metrics that enable a comprehensive evaluation of models reasoning capabilities. To validate the effectiveness of reasoning, we also propose a general agentic VLM-based framework that incorporates the models inherent reasoning abilities. Our method achieves state-of-the-art performance across five high-difficulty CAPTCHA types, with an average solving accuracy of 83.9 percent, substantially surpassing existing baselines. These results reveal the limitations of current models and highlight the importance of reasoning in advancing visual-spatial challenges in the future.

🔍 ShortSpan Analysis of the Paper

Problem

CAPTCHAs, originally designed to distinguish humans from machines, have evolved into real world benchmarks for evaluating the visual spatial reasoning of vision language models. The work shows that step by step reasoning is crucial for vision language models to solve CAPTCHA tasks, which are high difficulty spatial reasoning challenges, and that current commercial models still struggle with such reasoning, achieving only about 21.9 percent accuracy. It introduces CAPTCHA X, the first real world CAPTCHA benchmark with reasoning, spanning seven categories with step by step action solutions and grounding annotations, five reasoning oriented metrics, and an agentic VLM framework that leverages the model’s inherent reasoning. The results indicate that reasoning substantially improves solving performance and highlights a security vulnerability in automated human verification systems.

Approach

The authors build CAPTCHA X, a dataset of 1,839 puzzles across seven categories including grid based and non grid based challenges, collected by programmatic interaction with websites using Selenium and PyAutoGUI while recording mouse actions and screenshots. Ground truth coordinates are defined by manually marking valid regions and the model must click within acceptance regions. Reasoning steps are generated automatically by an LLM (GPT-5) conditioned on the ground truth action trajectory and supported by prompts designed to elicit goal directed, vision language aware and concise reasoning. Every reasoning step is validated by four domain experts with a high agreement rate. A comprehensive set of metrics is defined to evaluate solving accuracy and reasoning quality, including sequence level accuracy, reasoning steps, reasoning length, Reasoning score, Reasoning efficiency and Trajectory Complexity Index. The authors also propose a general agentic VLM pipeline that classifies puzzles into grid based or non grid based branches, uses a Mapping Tool to convert puzzles into a symbolic grid for reasoning, a Reasoning Steps Generator to perform structured step by step inference, a Spatial Understanding Expert to ground non grid tasks, a Discriminator to ensure coherence, an Action Generator to translate outputs into coordinates and an Action Executor to perform the clicks. They compare models with and without reasoning and find that reasoning improves solving accuracy, and they validate results with statistical tests and correlations between accuracy and spatial grounding. An agentic pipeline built on the model’s reasoning achieves state of the art on five CAPTCHA types.

Key Findings

Non reasoning performance of commercial vision language models on CAPTCHA tasks is around 21.9 percent accuracy; adding reasoning yields a substantial improvement with a relative gain of about 27.5 percent.
The CAPTCHA X benchmark comprises 1,839 puzzles across seven categories and provides step by step reasoning, ground truth coordinates and region based validation, enabling evaluation of both solving accuracy and reasoning quality.
An agentic VLM framework that relies on the model’s inherent reasoning achieves an average accuracy of 83.9 percent across seven CAPTCHA categories, setting new state of the art on five categories for general agents.
Five reasoning oriented metrics capture reasoning quality: Reasoning Steps, Reasoning Length, Reasoning Score, Reasoning Efficiency, and Trajectory Complexity Index; there is a strong correlation between Reasoning Score and Accuracy (roughly 0.88) and between Efficiency and Accuracy (roughly 0.82).
Two reasoning scaling laws are observed: a linear scaling relation linking reasoning score length and trajectory complexity with accuracy, and a power law linking accuracy to reasoning efficiency, suggesting predictable growth of performance with reasoning investment.
L2 distance between predicted coordinates and ground truth centres decreases when reasoning is used, confirming improved spatial grounding; regression shows higher accuracy strongly predicts smaller localisation errors (R squared about 0.97).
Statistical tests confirm significance of improvements (p less than 0.001 for both solving accuracy and L2 distance).

Limitations

The dataset represents real world CAPTCHA challenges with reasoning annotations; results are reported for a moving snapshot of models and may not generalise to all security settings. CAPTCHA X is designed for research purposes and the authors acknowledge the need for further adversarial testing and defence evaluation. The work focuses on the vulnerability of CAPTCHA style human verification and notes that defenses should push for non deterministic, context rich or multi step tasks and consider detecting or restricting model based chain of thought approaches.

Why It Matters

The study demonstrates that modern vision language models can bypass CAPTCHA challenges when they perform step by step reasoning, revealing a security vulnerability in automated human verification systems. It provides defenders with CAPTCHA X, ground truth reasoning traces, five reasoning oriented metrics and an agentic framework to test model based authentication and to assess new defence strategies. It argues for safer CAPTCHA designs that require non deterministic, context rich or multi step tasks and for detecting or restricting model based chain of thought approaches, highlighting the need for adversarial testing and robust verification pipelines. The work underlines broader risks of AI driven bypasses for fraud or abuse in online systems and calls for governance of AI enabled attacks.

Attribution Original paper on arXiv