Subtle Word Changes Break LLM Math Reasoning

Attacks

Published: Wed, Nov 12, 2025 • By Dr. Marcus Halden

Subtle Word Changes Break LLM Math Reasoning

Researchers show that tiny, single-word changes can sharply degrade the mathematical accuracy of Large Language Models (LLMs) and force them into much longer, costlier answers. The automated MSCR attack rewrites words with semantically similar alternatives and drops accuracy by up to 49.89%, while also bloating response length and transferring to commercial models.

Large Language Models (LLMs) are strong at many language tasks, including maths in benchmark tests. But strength on a lab test is not the same as robustness in the wild. A recent study introduces an automated method called MSCR that probes how small, natural-looking word swaps can break a model's mathematical reasoning and inflate its compute costs.

What the researchers did

The attack builds a short list of replacement words for each token in a question from three practical sources: the model's own embedding space, the WordNet dictionary, and a masked language model (the paper uses BERT large uncased). Each candidate is semantically close to the original word and readable. The method then substitutes words one at a time and checks whether the model's answer changes. If a single swap makes the model give a different result, the attack counts as successful. The team ran this against twelve open source LLMs on two standard maths benchmarks, GSM8K and MATH500, averaging three runs with temperature set to 0.6. They also tested transferability to commercial models.

The headline findings are clear and worrying. On GSM8K one model drops accuracy by as much as 49.89% after tiny word changes; on MATH500 the worst drop is 35.40%. Across models, most fall by more than 20% on GSM8K and more than 10% on MATH500. The perturbed problems remain semantically consistent, so the issue is not nonsense inputs but fragile internal reasoning.

Another consequence is efficiency. Perturbations push models to produce much longer answers. Average response length rises significantly — the worst case reaches roughly 2.14 times the original length, and some settings show more than a tenfold increase in reasoning length. That means the attack both degrades correctness and raises computational cost.

Implications and limitations

This is not a fanciful academic exercise. The attack is practical: it uses off-the-shelf resources and creates replacements that look normal to a human reader. Real systems that grade maths, tutor students, or provide decision support could be made unreliable or more expensive by subtle input manipulation. The paper also finds transferability: commercial models show notable drops as well, for example roughly 15.88% and 17.21% in some tests.

That said, the study has bounds. It focuses on two benchmarks, a fixed set of replacement sources, and a particular masked language model. Different datasets, alternative synonym resources, or model updates might change the numbers. The experiments are averages over three runs, which smooths randomness but does not capture every source of variation.

Operationally, defenders have straightforward options. One can validate and sanitise incoming text, check model outputs with symbolic maths or deterministic solvers, and train models on adversarially perturbed examples similar to MSCR. Rate and cost controls can reduce the economic impact of longer outputs.

Operational takeaways: add input sanitisation, verify numeric answers with symbolic tools, and include adversarial examples in training.

The bottom line is simple: small, human-plausible edits can both fool and bloat LLM reasoning. For anyone deploying maths-capable models, robustness checks against this class of perturbation are not optional; they are practical risk management.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

MSCR: Exploring the Vulnerability of LLMs' Mathematical Reasoning Abilities Using Multi-Source Candidate Replacement

Authors: Zhishen Sun, Guang Dai, and Haishan Ye

LLMs demonstrate performance comparable to human abilities in complex tasks such as mathematical reasoning, but their robustness in mathematical reasoning under minor input perturbations still lacks systematic investigation. Existing methods generally suffer from limited scalability, weak semantic preservation, and high costs. Therefore, we propose MSCR, an automated adversarial attack method based on multi-source candidate replacement. By combining three information sources including cosine similarity in the embedding space of LLMs, the WordNet dictionary, and contextual predictions from a masked language model, we generate for each word in the input question a set of semantically similar candidates, which are then filtered and substituted one by one to carry out the attack. We conduct large-scale experiments on LLMs using the GSM8K and MATH500 benchmarks. The results show that even a slight perturbation involving only a single word can significantly reduce the accuracy of all models, with the maximum drop reaching 49.89% on GSM8K and 35.40% on MATH500, while preserving the high semantic consistency of the perturbed questions. Further analysis reveals that perturbations not only lead to incorrect outputs but also substantially increase the average response length, which results in more redundant reasoning paths and higher computational resource consumption. These findings highlight the robustness deficiencies and efficiency bottlenecks of current LLMs in mathematical reasoning tasks.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies the vulnerability of large language models (LLMs) to minor input perturbations in mathematical reasoning. It presents MSCR, an automated adversarial attack that assesses robustness and efficiency by perturbing input questions through multi source candidate replacement, highlighting potential reliability and cost concerns in math oriented AI applications.

Approach

MSCR generates for each word in a question a set of semantically similar candidates from three sources: cosine similarity in the model embedding space, the WordNet dictionary, and contextual predictions from a masked language model. The cosine based candidates are normalised and filtered to the top ten by similarity, with additional character level and morphological checks to preserve readability. WordNet supplies synonym based candidates, from which the top five most similar are kept. The masked language model supplies high confidence replacements by masking the target word and selecting candidates with probability greater than 0.1. The union of these candidates forms the final set. Word level replacements may be global for WordNet and MLM to preserve grammar, while cosine based replacements are local. After each replacement a new adversarial input is solved by the target model. An attack is preliminarily successful if the perturbed input yields a different answer from the original; otherwise the search continues with other candidates or words. A secondary evaluation uses a commercial LLM to confirm the attack, improving robustness of the assessment. Experiments span twelve open source LLMs across GSM8K and MATH500 benchmarks, with temperature fixed at 0.6 and each setup run three times and averaged. The MLM used is bert large uncased, and transferability is tested on OpenAI o3 and GPT-4 o with a secondary evaluation model qwen3 max preview.

Key Findings

Even a single word perturbation can substantially reduce accuracy across models; on GSM8K the maximum drop is 49.89% (Qwen2.5 Math 1.5B Instruct) and on MATH500 the maximum drop is 35.40% (gemma 2 9b it).
Across GSM8K, almost all open source models suffer an accuracy drop greater than 20%, with around half exceeding 30% deterioration; on MATH500 all open source models drop by more than 10%, and about half by more than 20%.
Transferability exists to commercial models; on GSM8K OpenAI o3 drops 15.88% and GPT 4 o 17.21%, while on MATH500 the drops are 6.80% and 7.80% respectively, indicating cross model vulnerability to the MSCR perturbations.
Perturbations bias the model to generate longer answers; average response length increases significantly, reaching up to 2.14 times the original length, with some reasoning models showing a collapse of more than tenfold in length, implying longer, more resource intensive reasoning paths.
Larger models within a family tend to show relatively higher robustness, yet still incur substantial declines (greater than 20% on GSM8K and greater than 10% on MATH500), underscoring persistent robustness challenges in mathematical reasoning tasks.
The results collectively reveal robustness deficiencies and efficiency bottlenecks in current LLMs when dealing with mathematical reasoning under adversarial input.

Limitations

The evaluation used two benchmarks, GSM8K and MATH500, across twelve open source LLMs with a specific set of perturbation sources and an MLM, which may limit generalisability to other tasks, models, or perturbation methods. The reliance on WordNet, a single MLM, and a particular embedding space means results could vary with alternative resources. A secondary verification step with commercial models, while increasing robustness of the assessment, may not capture all real world scenarios or model updates. Results are reported as averages over three trials per configuration, which may not fully capture all sources of variability.

Why It Matters

The findings demonstrate concrete vulnerabilities in LLM based mathematical reasoning to subtle word level changes, with implications for tutoring, auto grading, decision support and other math enabled services. Attacks can degrade accuracy and trigger longer, costlier responses, reducing reliability and efficiency. Mitigations suggested include adversarial training with MSCR like perturbations, input validation and sanitisation, result verification with symbolic tools, and defenses against prompt based manipulation or abnormal output growth.

Attribution Original paper on arXiv