Subtle Word Changes Break LLM Math Reasoning
Attacks
Large Language Models (LLMs) are strong at many language tasks, including maths in benchmark tests. But strength on a lab test is not the same as robustness in the wild. A recent study introduces an automated method called MSCR that probes how small, natural-looking word swaps can break a model's mathematical reasoning and inflate its compute costs.
What the researchers did
The attack builds a short list of replacement words for each token in a question from three practical sources: the model's own embedding space, the WordNet dictionary, and a masked language model (the paper uses BERT large uncased). Each candidate is semantically close to the original word and readable. The method then substitutes words one at a time and checks whether the model's answer changes. If a single swap makes the model give a different result, the attack counts as successful. The team ran this against twelve open source LLMs on two standard maths benchmarks, GSM8K and MATH500, averaging three runs with temperature set to 0.6. They also tested transferability to commercial models.
The headline findings are clear and worrying. On GSM8K one model drops accuracy by as much as 49.89% after tiny word changes; on MATH500 the worst drop is 35.40%. Across models, most fall by more than 20% on GSM8K and more than 10% on MATH500. The perturbed problems remain semantically consistent, so the issue is not nonsense inputs but fragile internal reasoning.
Another consequence is efficiency. Perturbations push models to produce much longer answers. Average response length rises significantly — the worst case reaches roughly 2.14 times the original length, and some settings show more than a tenfold increase in reasoning length. That means the attack both degrades correctness and raises computational cost.
Implications and limitations
This is not a fanciful academic exercise. The attack is practical: it uses off-the-shelf resources and creates replacements that look normal to a human reader. Real systems that grade maths, tutor students, or provide decision support could be made unreliable or more expensive by subtle input manipulation. The paper also finds transferability: commercial models show notable drops as well, for example roughly 15.88% and 17.21% in some tests.
That said, the study has bounds. It focuses on two benchmarks, a fixed set of replacement sources, and a particular masked language model. Different datasets, alternative synonym resources, or model updates might change the numbers. The experiments are averages over three runs, which smooths randomness but does not capture every source of variation.
Operationally, defenders have straightforward options. One can validate and sanitise incoming text, check model outputs with symbolic maths or deterministic solvers, and train models on adversarially perturbed examples similar to MSCR. Rate and cost controls can reduce the economic impact of longer outputs.
- Operational takeaways: add input sanitisation, verify numeric answers with symbolic tools, and include adversarial examples in training.
The bottom line is simple: small, human-plausible edits can both fool and bloat LLM reasoning. For anyone deploying maths-capable models, robustness checks against this class of perturbation are not optional; they are practical risk management.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
MSCR: Exploring the Vulnerability of LLMs' Mathematical Reasoning Abilities Using Multi-Source Candidate Replacement
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies the vulnerability of large language models (LLMs) to minor input perturbations in mathematical reasoning. It presents MSCR, an automated adversarial attack that assesses robustness and efficiency by perturbing input questions through multi source candidate replacement, highlighting potential reliability and cost concerns in math oriented AI applications.
Approach
MSCR generates for each word in a question a set of semantically similar candidates from three sources: cosine similarity in the model embedding space, the WordNet dictionary, and contextual predictions from a masked language model. The cosine based candidates are normalised and filtered to the top ten by similarity, with additional character level and morphological checks to preserve readability. WordNet supplies synonym based candidates, from which the top five most similar are kept. The masked language model supplies high confidence replacements by masking the target word and selecting candidates with probability greater than 0.1. The union of these candidates forms the final set. Word level replacements may be global for WordNet and MLM to preserve grammar, while cosine based replacements are local. After each replacement a new adversarial input is solved by the target model. An attack is preliminarily successful if the perturbed input yields a different answer from the original; otherwise the search continues with other candidates or words. A secondary evaluation uses a commercial LLM to confirm the attack, improving robustness of the assessment. Experiments span twelve open source LLMs across GSM8K and MATH500 benchmarks, with temperature fixed at 0.6 and each setup run three times and averaged. The MLM used is bert large uncased, and transferability is tested on OpenAI o3 and GPT-4 o with a secondary evaluation model qwen3 max preview.
Key Findings
- Even a single word perturbation can substantially reduce accuracy across models; on GSM8K the maximum drop is 49.89% (Qwen2.5 Math 1.5B Instruct) and on MATH500 the maximum drop is 35.40% (gemma 2 9b it).
- Across GSM8K, almost all open source models suffer an accuracy drop greater than 20%, with around half exceeding 30% deterioration; on MATH500 all open source models drop by more than 10%, and about half by more than 20%.
- Transferability exists to commercial models; on GSM8K OpenAI o3 drops 15.88% and GPT 4 o 17.21%, while on MATH500 the drops are 6.80% and 7.80% respectively, indicating cross model vulnerability to the MSCR perturbations.
- Perturbations bias the model to generate longer answers; average response length increases significantly, reaching up to 2.14 times the original length, with some reasoning models showing a collapse of more than tenfold in length, implying longer, more resource intensive reasoning paths.
- Larger models within a family tend to show relatively higher robustness, yet still incur substantial declines (greater than 20% on GSM8K and greater than 10% on MATH500), underscoring persistent robustness challenges in mathematical reasoning tasks.
- The results collectively reveal robustness deficiencies and efficiency bottlenecks in current LLMs when dealing with mathematical reasoning under adversarial input.
Limitations
The evaluation used two benchmarks, GSM8K and MATH500, across twelve open source LLMs with a specific set of perturbation sources and an MLM, which may limit generalisability to other tasks, models, or perturbation methods. The reliance on WordNet, a single MLM, and a particular embedding space means results could vary with alternative resources. A secondary verification step with commercial models, while increasing robustness of the assessment, may not capture all real world scenarios or model updates. Results are reported as averages over three trials per configuration, which may not fully capture all sources of variability.
Why It Matters
The findings demonstrate concrete vulnerabilities in LLM based mathematical reasoning to subtle word level changes, with implications for tutoring, auto grading, decision support and other math enabled services. Attacks can degrade accuracy and trigger longer, costlier responses, reducing reliability and efficiency. Mitigations suggested include adversarial training with MSCR like perturbations, input validation and sanitisation, result verification with symbolic tools, and defenses against prompt based manipulation or abnormal output growth.