Local LLM speeds x86 reverse engineering with REx86

Defenses

Published: Fri, Oct 24, 2025 • By Dr. Marcus Halden

Local LLM speeds x86 reverse engineering with REx86

Researchers fine tune local, open-weight Large Language Models (LLMs) to help with x86 reverse engineering in air-gapped and privacy-sensitive environments. The top model, REx86, reduces model loss by 64.2% and raises semantic similarity by 20.3%. A limited user study shows better line-level understanding and faster analyst workflows, with caveats.

Reverse engineering of x86 binaries is a slow, detail-heavy task. Compilation strips names and structure, and deliberate obfuscation slows analysts further. Cloud-hosted models can help, but they carry data-leak and policy problems and cannot run inside air-gapped facilities. This study asks whether local, open-weight Large Language Models (LLMs) can offer useful, privacy-preserving assistance for disassembly and commentary.

Approach and results

The authors fine tune eight open-weight models from the CodeLlama, Qwen2.5-Coder, and CodeGemma families using a curated set of 5,981 x86 assembly examples. Fine tuning uses parameter-efficient adapters (LoRA) and 4-bit quantisation so the resulting models can run on consumer-class hardware. Training ran for three epochs on a single 48 GB GPU. Among the tuned variants the Qwen2.5-Coder-7B model performed best and is released as REx86.

Quantitative gains are clear: REx86 cuts test-set cross-entropy loss by 64.2% relative to its base model and improves semantic cosine similarity to ground-truth comments by 20.3%. In a modest human study with 43 participants, use of REx86 produced a statistically significant improvement in line-level code understanding (p = 0.031) and increased the correct-solve rate from 31% to 53%, though that change did not reach statistical significance (p = 0.189). Qualitative checks found REx86 produces shorter, more accurate comments and fewer hallucinations than the untuned base model.

Implications and limits

The practical takeaway is straightforward: domain-specific fine tuning can make local LLMs genuinely helpful for reverse engineering tasks while keeping sensitive binaries off the network. That matters for incident response in critical sectors and for teams that cannot use cloud services. At the same time there are real limits. The training set is relatively small, the user evaluation covers a single specimen and a limited cohort, and the authors note signs of overfitting after two epochs. The work targets x86 only, so other architectures will need fresh data and adapters.

There are also dual-use concerns. Better offline tools lower the barrier for anyone wanting to analyse proprietary or protected firmware. The paper acknowledges this tension and points to sensible mitigations: careful provenance of training data, human-in-the-loop verification, and access controls for the model and dataset.

Operational takeaways follow.

Run local, fine-tuned models for sensitive reverse engineering to avoid cloud exposure, but treat outputs as analyst aids not ground truth.
Validate model suggestions with standard static and dynamic checks; do not automate remediation decisions solely on LLM commentary.
Keep training and adapter provenance documented and restrict access to reduce misuse risk.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

REx86: A Local Large Language Model for Assisting in x86 Assembly Reverse Engineering

Authors: Darrin Lea, James Ghawaly, Golden Richard III, Aisha Ali-Gombe, and Andrew Case

Reverse engineering (RE) of x86 binaries is indispensable for malware and firmware analysis, but remains slow due to stripped metadata and adversarial obfuscation. Large Language Models (LLMs) offer potential for improving RE efficiency through automated comprehension and commenting, but cloud-hosted, closed-weight models pose privacy and security risks and cannot be used in closed-network facilities. We evaluate parameter-efficient fine-tuned local LLMs for assisting with x86 RE tasks in these settings. Eight open-weight models across the CodeLlama, Qwen2.5-Coder, and CodeGemma series are fine-tuned on a custom curated dataset of 5,981 x86 assembly examples. We evaluate them quantitatively and identify the fine-tuned Qwen2.5-Coder-7B as the top performer, which we name REx86. REx86 reduces test-set cross-entropy loss by 64.2% and improves semantic cosine similarity against ground truth by 20.3\% over its base model. In a limited user case study (n=43), REx86 significantly enhanced line-level code understanding (p = 0.031) and increased the correct-solve rate from 31% to 53% (p = 0.189), though the latter did not reach statistical significance. Qualitative analysis shows more accurate, concise comments with fewer hallucinations. REx86 delivers state-of-the-art assistance in x86 RE among local, open-weight LLMs. Our findings demonstrate the value of domain-specific fine-tuning, and highlight the need for more commented disassembly data to further enhance LLM performance in RE. REx86, its dataset, and LoRA adapters are publicly available at https://github.com/dlea8/REx86 and https://zenodo.org/records/15420461.

🔍 ShortSpan Analysis of the Paper

Problem

Reverse engineering of x86 binaries is essential for malware and firmware analysis but remains slow because compilation removes metadata and obfuscation can hinder comprehension. Large Language Models offer potential to accelerate reverse engineering through automated understanding and commentary, yet cloud based closed weight models pose privacy and security risks and cannot operate in closed networks. This work evaluates parameter efficient fine tuned local language models for assisting with x86 reverse engineering in environments where offline operation is required.

Approach

Eight open weight models across CodeLlama, Qwen2.5 Coders, and CodeGemma series were fine tuned on a curated dataset of 5981 x86 assembly examples. Fine tuning used Unsloth, a framework that employs parameter efficient fine tuning with LoRA and 4 bit quantization, enabling operation on consumer hardware. Models ranged from 3B to 34B parameters and were tuned with LoRA ranks of 8, 16 and 32 and alpha ratios of 1 1 and 2 1. Training occurred on a single Nvidia RTX 6000 GPU with 48 GB VRAM for three epochs. Evaluation used cross entropy loss and semantic cosine similarity with embeddings generated by NV Embed v2. The fine tuned Qwen2.5 Coders 7B model emerged as the best performing instance and was named REx86. The dataset and LoRA adapters are publicly available.

Key Findings

REx86 reduces test set cross entropy loss by 64.2 per cent and improves semantic cosine similarity against ground truth by 20.3 per cent compared with its base model.
In a human case study with 43 participants, REx86 significantly enhanced line level code understanding (p equals 0.031) and raised the correct solve rate from 31 per cent to 53 per cent (p equals 0.189), though the latter did not reach statistical significance.
Qualitative analysis shows more accurate and concise comments with fewer hallucinations when using REx86 compared with the base model.
REx86 represents state of the art for assistance in x86 reverse engineering among local open weight models. Five of eight fine tuned models showed meaningful gains in Cosine Similarity across tasks; the best results overall were from Qwen2.5 Coders 7B and CodeLlama 7B before further interpretation.
The best performing model for downstream evaluation was a fine tuned Qwen2.5 Coders 7B model, designated REx86, which was used for subsequent human study and qualitative assessment.

Limitations

The human evaluation utilised a teaching lab with a single malware specimen and a modest sample size, which limits generalisability. The fine tuning dataset, while substantial for this study, remains relatively small due to limited availability of commented disassembly data. Extensions to other architectures beyond x86 were not addressed and would require additional data and potentially architecture specific adapters. Overfitting tendencies were observed after two training epochs, suggesting cautious selection of training duration for future work.

Why It Matters

The work demonstrates that a local, open weight LLM can provide meaningful assistance for reverse engineering while preserving data privacy and remaining operable offline in air gapped or tightly controlled environments. The approach reduces reliance on cloud based services, enhances analyst understanding at the line level, and can speed incident response in critical sectors. However the study also highlights dual use risks: offline RE capabilities could expedite analysis of protected or obfuscated code by malicious actors. Mitigations include careful evaluation of reliability, provenance of training data, and guardrails in disassembly workflows to balance benefits for defenders with potential misuse. The results underscore the value of domain specific fine tuning and the need for more commented disassembly data to further improve RE performance.

Attribution Original paper on arXiv