Researchers Expose Easy LLM Hacking That Flips Results

Attacks

Published: Thu, Sep 11, 2025 • By Clara Nyx

Researchers Expose Easy LLM Hacking That Flips Results

New research shows large language models used for text annotation can flip scientific conclusions simply by changing models, prompts, or settings. The team replicates 37 annotation tasks across 18 models and finds state-of-the-art systems produce wrong conclusions in about one in three hypotheses. The paper warns deliberate manipulation is trivial.

A new paper delivers a rude wakeup call: treating large language models as cheap, reliable annotators is now a security and integrity risk. Researchers replicated 37 annotation tasks across 18 models and 13 million labels and found that modern LLMs give incorrect conclusions for roughly one in three hypotheses. Smaller models perform worse, and near-significance results are especially fragile.

Why this matters: policy decisions, academic findings, and media claims increasingly rely on automated labels. When a handful of model choices or prompt paraphrases can manufacture false positives for 94 percent of null hypotheses, cheap automation stops being a productivity tool and starts being a vulnerability. Imagine a policy study on hate speech or misinformation flipping simply because the analyst tried a different prompt. That is not hypothetical anymore.

The worst part is intentionality. The authors show deliberate LLM hacking is remarkably simple: with a few prompts and a few models you can hide real effects or create fake ones. Common statistical band-aids do not solve the problem. Even heavy-handed corrections trade off false positives for false negatives and rarely eliminate the underlying fragility.

Short verdict: LLMs are instruments, not oracles. Use them like you would a lab machine that needs calibration and logging. Don’t trust results that sit on the edge of significance without human verification.

Actions to take now - Always pre-register and publish full model configurations and decoding settings. - Add targeted human annotations for borderline findings and run cross-model sensitivity checks before publishing claims.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

Authors: Joachim Baumann, Paul Röttger, Aleksandra Urman, Albert Wendsjö, Flor Miriam Plaza-del-Arco, Johannes B. Gruber, and Dirk Hovy

Large language models (LLMs) are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection, prompting strategy, or temperature settings). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I, Type II, Type S, or Type M errors. We call this LLM hacking. We quantify the risk of LLM hacking by replicating 37 data annotation tasks from 21 published social science research studies with 18 different models. Analyzing 13 million LLM labels, we test 2,361 realistic hypotheses to measure how plausible researcher choices affect statistical conclusions. We find incorrect conclusions based on LLM-annotated data in approximately one in three hypotheses for state-of-the-art models, and in half the hypotheses for small language models. While our findings show that higher task performance and better general model capabilities reduce LLM hacking risk, even highly accurate models do not completely eliminate it. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of findings near significance thresholds. Our extensive analysis of LLM hacking mitigation techniques emphasizes the importance of human annotations in reducing false positive findings and improving model selection. Surprisingly, common regression estimator correction techniques are largely ineffective in reducing LLM hacking risk, as they heavily trade off Type I vs. Type II errors. Beyond accidental errors, we find that intentional LLM hacking is unacceptably simple. With few LLMs and just a handful of prompt paraphrases, anything can be presented as statistically significant.

🔍 ShortSpan Analysis of the Paper

Problem

Large language models are increasingly used to automate text annotation in social science research, but outputs vary dramatically with implementation choices such as model selection, prompting strategy and decoding settings. This variation can introduce systematic biases and random errors that propagate to downstream analyses, causing Type I false positives, Type II false negatives, Type S wrong sign and Type M exaggerated effects. The authors label this phenomenon LLM hacking and seek to quantify its risk, understand mitigating strategies and highlight the security implications for research integrity and policy conclusions.

Approach

The study combines a systematic literature review with a large scale empirical replication. The authors replicate 37 data annotation tasks drawn from 21 social science studies using 18 language models, analysing 13 million labels and testing 2 361 realistic hypotheses about how researcher choices influence statistical conclusions. They examine tasks such as stance detection and ideology classification, and assess how annotations feed into regression analyses. The work is grounded by a literature review of 103 papers and involves an extensive evaluation of mitigation techniques. The analysis considers the effect of model performance, task characteristics and prompt choices on LLM hacking risk and compares human versus machine based annotation strategies.

Key Findings

On average state of the art models have about a one in three chance of LLM hacking, with risk ranging from roughly 31 to 50 per cent across tasks and configurations.
Risk is present across model sizes, from small 1B parameter models with around 50 per cent error rates to large models with about 31 per cent risk for some configurations; some tasks exhibit very low risk (around 5 per cent for humor detection) while others exceed 65 per cent for ideology and frame classification.
Type II errors dominate, with missed true effects occurring in 31 to 59 per cent of cases depending on model size; even when effects are detected, estimated effect sizes can be substantially biased, with deviations of 40 to 77 per cent from true values.
Intentional LLM hacking is feasible and strikingly easy. With a small set of models and a few prompt paraphrases, attackers can manufacture false positives for about 94 per cent of null hypotheses and hide true effects in about 98 per cent of cases. Significance can be reversed in about 68 per cent of cases with true differences, even among top performing models.
Proximity to the significance threshold is the strongest predictor of hacking risk; task characteristics explain around 21 per cent of the variance, while model performance accounts for about 8 per cent; prompt engineering contributes less than 1 per cent.
There is no observed correlation between human inter annotator agreement and LLM hacking risk, meaning even high human consensus does not guarantee reliable LLM based conclusions.
Using 100 human annotations beats 100 000 LLM annotations for protecting against false positives, achieving about 10 per cent error with modest human data; hybrid approaches with very large LLM based annotation sets have higher false negative rates.
Regression estimator correction techniques largely fail to reduce LLM hacking risk because they trade off Type I against Type II errors; even optimal strategies reduce risk only moderately. For instance, 1000 human annotations can lower overall risk to about 20 per cent, but not eliminate it.
Overall the authors urge a fundamental shift in practice: LLMs should be treated as instruments requiring careful calibration and validation rather than as simple black box annotators. They provide practical recommendations to limit both accidental and deliberate LLM hacking and advocate transparency through pre registration and full reporting of configuration choices.

Limitations

The study focuses on 37 annotation tasks across 18 models and 21 data sets within computational social science, and findings are constrained to these contexts and configurations. While extensive, the results may not fully generalise to all possible annotation tasks or all model types. The analysis relies on ground truth data and specific experimental designs, so extrapolation beyond the tested settings should be undertaken with caution.

Why It Matters

The work highlights a previously under recognised vulnerability in AI assisted research: seemingly minor changes in model configuration can flip scientific conclusions, potentially biasing policy relevant findings and governance decisions. The authors emphasise that exploitation of this vulnerability could enable manipulation of results at low cost, underscoring the need for robust verification, cross model checks, sensitivity analyses and explicit documentation of configuration choices. They argue for a shift from viewing LLMs as convenient annotators to treating them as complex instruments that require validation, with practical mitigations including human annotations, transparency standards and pre registration to guard against both accidental and deliberate manipulation.

Attribution Original paper on arXiv