Adversarial tweaks mislead binary code similarity detectors

Attacks

Published: Mon, Feb 16, 2026 • By Elise Veyron

Adversarial tweaks mislead binary code similarity detectors

New research tests how machine learning models for binary code similarity detection react to small, semantics-preserving changes. Using asmFooler, the authors generate 9,565 variants and show modest edits can trigger false positives and false negatives across six popular models. Robustness hinges on preprocessing, features, and architecture, with mixed resilience and clear policy implications.

Security teams lean on binary code similarity detection to cluster malware, spot code reuse, and support reverse engineering when source code is unavailable. As these tools adopt machine learning, they promise speed and scale. They also inherit a familiar problem in adversarial machine learning: small, carefully chosen changes can flip a model’s verdict without changing what the program actually does.

A new study probes that gap. The authors evaluate six machine learning models for binary code similarity detection (BCSD) against semantics-preserving transformations using a framework they call asmFooler. They transform 620 baseline samples into 9,565 binary variants with eight diversification techniques, including in-place code randomisation, inter–basic block reordering, semantic NOPs, junk code insertion, and compiler-driven obfuscation via Obfuscator-LLVM. The threat model is black box: the attacker only mutates the raw binary and observes the model’s outcome. To reflect real model limits, a transformation budget caps added instructions or bytes based on each model’s input constraints.

The headline result is straightforward and uncomfortable. With modest edits, attackers can induce both false negatives and false positives. The details matter. Robustness is not a single metric but a property of the whole pipeline: preprocessing and normalisation, feature choice, and architecture. Models that use control-flow graph (CFG) structure, such as Gemini and Genius, or dynamic signals like micro-traces in Trex, withstand some perturbations better than instruction-level models. Approaches that rely mainly on instruction sequences, such as BinShot, are more exposed to basic block reordering. Yet LLVM-based obfuscation dents performance broadly; interestingly, BinShot shows relative robustness there. Different models lean on different cues, and each cue can be targeted.

Attack potency is notable. The study reports that well-crafted edits can drive false positives to high rates with small changes, on the order of tens of instructions and budgets up to 100 bytes, and that false negatives often require even less. To understand why, the authors use explainability tools. Under attack, attention and feature importance disperse: SHAP values collapse towards zero and saliency on key tokens drops. The model becomes less anchored to the truly informative parts of the function, increasing the chance of a wrong match.

Transferability is mixed. Perturbations that fool one model sometimes carry over to others, especially when input representations are similar. Graph-based models are generally harder to fool this way, though there are notable exceptions; Trex shows transfer in some pairings. This variability is useful operationally: it suggests ensembles drawn from diverse representations can buy time and accuracy, though they are not a silver bullet.

For defenders, two implications stand out. First, ML-as-a-service pipelines that gate decisions on a single BCSD score are at risk. Small edits can push borderline samples over a threshold, enabling evasion or misattribution. That matters for malware triage, code provenance claims, and cross-binary vulnerability matching in firmware or supply chains. Second, robustness is designable. The study points towards practical defences: adversarial training or data augmentation with semantics-preserving transforms; combining static and dynamic features; and integrating multiple analysis methods so that an attacker must defeat more than one representation at once. Because the effectiveness of attacks is bounded by model-specific input limits, exposing and stress-testing those limits is part of engineering, not an afterthought.

There is a policy angle here too. If public bodies or critical suppliers procure automated analysis for attribution or compliance, they should require evidence of performance under semantics-preserving transformations, not just clean benchmarks. Independent evaluations using datasets like asmFooler can inform certification and auditing. Vendors should document preprocessing, input constraints, and model assumptions, so operators understand how decisions can fail. And organisations should avoid treating BCSD outputs as dispositive in enforcement or legal contexts without corroboration.

The work has limits, as the authors note: no formal verification of every transformation, a focus on 64-bit ELF, and partial coverage of obfuscation and protections. Still, it offers a concrete path to measure and improve robustness. If we standardise such testing and set expectations in procurement and oversight, we can nudge this class of tools toward safer defaults without stalling innovation.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Fool Me If You Can: On the Robustness of Binary Code Similarity Detection Models against Semantics-preserving Transformations

Authors: Jiyong Uhm, Minseok Kim, Michalis Polychronakis, and Hyungjoon Koo

Binary code analysis plays an essential role in cybersecurity, facilitating reverse engineering to reveal the inner workings of programs in the absence of source code. Traditional approaches, such as static and dynamic analysis, extract valuable insights from stripped binaries, but often demand substantial expertise and manual effort. Recent advances in deep learning have opened promising opportunities to enhance binary analysis by capturing latent features and disclosing underlying code semantics. Despite the growing number of binary analysis models based on machine learning, their robustness to adversarial code transformations at the binary level remains underexplored. We evaluate the robustness of deep learning models for the task of binary code similarity detection (BCSD) under semantics-preserving transformations. The unique nature of machine instructions presents distinct challenges compared to the typical input perturbations found in other domains. We introduce asmFooler, a system that evaluates the resilience of BCSD models using a diverse set of adversarial code transformations that preserve functional semantics. We construct a dataset of 9,565 binary variants from 620 baseline samples by applying eight semantics-preserving transformations across six representative BCSD models. Our major findings highlight several key insights: i) model robustness relies on the processing pipeline, including code pre-processing, architecture, and feature selection; ii) adversarial transformation effectiveness is bounded by a budget shaped by model-specific constraints like input size and instruction expressive capacity; iii) well-crafted transformations can be highly effective with minimal perturbations; and iv) such transformations efficiently disrupt model decisions (e.g., misleading to false positives or false negatives) by focusing on semantically significant instructions.

🔍 ShortSpan Analysis of the Paper

Problem

The paper investigates the robustness of deep learning based binary code similarity detection BCSD models to semantics preserving transformations. It highlights the risk that automated binary analysis used for malware detection, attribution, and reverse engineering can be misled by small changes that do not alter program semantics, raising security concerns for ML as a service and other automated analysis pipelines.

Approach

The authors introduce asmFooler a system that generates adversarial variants of binaries by applying eight semantics preserving transformations across six BCSD models. They assemble a corpus of 9 565 transformed binaries from 620 baseline samples and evaluate six models Genius Gemini Asm2Vec SAFE Trex and BinShot. Transformations cover code diversification including in place code randomisation inter basic block re ordering and semantic nop implantation junk code insertion and compiler driven obfuscation via Obfuscator LLVM. A transformation budget limits added instructions to reflect model input constraints such as maximum tokens or instruction counts. The threat model assumes a black box BCSD service where an attacker can only mutate the raw binary to try to flip the model outcome. The evaluation examines false negatives FN false positives FP and transferability of FP triggers across models. The FP triggering perturbations employ a greedy prefix based sampling strategy to craft code that makes dissimilar functions appear similar. The authors also analyse the作用 of these perturbations with explainable AI methods SHAP and saliency to understand shifts in attention and token importance.

Key Findings

Model robustness is strongly shaped by the processing pipeline comprising pre processing normalisation feature selection and architectural choices; models that incorporate CFG information such as Gemini and Genius or dynamic features such as micro traces in Trex tend to be more robust to certain transformations, while those relying mainly on instruction level features such as BinShot are more vulnerable to basic block re ordering.
A transformation budget bound limits the effectiveness of adversarial changes enabling only a finite amount of perturbation per function dictated by input size and expressive capacity of equivalent instructions.
Well crafted adversarial transformations can achieve high attack success rates with modest perturbations; FP triggering can reach up to 100 per cent ASR with on average around 14 75 instructions and budgets up to 100 bytes, while FN triggering is effective with smaller modifications in many models.
FP triggering perturbations disrupt model decisions by distorting internal attention patterns and token importance; SHAP values concentrate near zero and saliency decreases indicating reduced reliance on crucial features leading to false positives.
Transferability of FP triggers varies across models; attacks transfer more readily between architectures with similar input representations, while CFG or graph based models show greater resilience; Trex shows notable transfer in some comparisons.
LLVM based obfuscation generally degrades performance across models; however BinShot demonstrates relative robustness to obfuscation suggesting different models rely on different features; graph based models show particular sensitivity to CFG perturbations.

Limitations

Limitations include lack of formal verification of in place transformations and non deterministic variant generation which can influence robustness estimates. The study focuses on 64 bit ELF binaries and a selected set of transformations and models; some transformations fail within time limits or due to tool compatibility; the work does not exhaustively cover all obfuscation techniques nor advanced protections such as encryption or packing; XAI fidelity analyses are exploratory and not conclusive.

Why It Matters

The findings show that ML based BCSD detectors can be fooled by semantics preserving changes enabling evasion of malware detection or misattribution, particularly in ML as a service contexts. Defences suggested include adversarial training data augmentation combining static and dynamic features, and integrating multiple analysis approaches to reduce reliance on a single representation. The work provides a methodology and dataset asmFooler for systematic robustness testing of BCSD models and highlights the importance of pipeline design in building resilient security aware machine learning systems.

Attribution Original paper on arXiv