Latent Geometric Chords slip past robust vision models
Attacks
Decision-based black-box attacks, the kind that only see top-1 labels, used to mean ugly, high-frequency noise and a lot of guesswork. Latent Geometric Chords (LGC) changes the playbook: it keeps perturbations semantically clean, converges in a few thousand queries, and still topples adversarially trained models.
The setup is simple enough. You invert the image into an autoencoder’s latent space and search for the decision boundary with a curvature-aware, semicircular move rather than a straight push. That trims the query budget because you spend less time skidding along flat regions and more time tracing where the classifier actually flips.
The trick they almost buried
Residual-based Adversarial Generation (RAG) is the real exploit here. Instead of trusting the decoder’s reconstruction, LGC decodes both the original and the perturbed latents, takes their difference (a geometric chord), and pastes that residual directly onto the source image. Two wins: you dodge the decoder’s artefacts, and you expand the search space beyond the tight generative manifold. Under a Lipschitz generator, the paper argues these chord residuals give you up to 2k effective dimensions if the latent is k-dimensional. In practice, that means you can step off-manifold just enough to find adversarial pockets the encoder–decoder would otherwise hide from you, without blowing visual fidelity.
The numbers back the pitch. On ImageNet, Places365 and CelebAMask-HQ, LGC reports SSIM above 0.99 and LPIPS below 0.01 at 5,000 queries in some settings, and it lands targeted misclassifications against ResNet-50 with far smaller L2 perturbations than prior methods, including up to a sixfold reduction in one comparison. It transfers across datasets using a single ImageNet-trained autoencoder and breaks a spread of architectures: Vision Transformers, ResNets, VGG and DenseNet. The variant LGC-H trades a bit of finesse for speed.
From an operator’s angle, this is tailor-made for endpoints that expose only labels or accept/reject signals. You get query efficiency, human-plausible edits, and a way around latent-consistency checks. Any defence counting on pixel norms, reconstruction error, or the assumption that “on-manifold checks are enough” will have a bad day.
There are caveats. Performance depends on the autoencoder backbone; a VGG16 latent behaved more coherently than ResNet-50 in their tests. The dimension-doubling story leans on Lipschitz continuity of the generator. And yes, thousands of queries may still be a non-starter against tight rate limits. But the chord residual idea is the seam to pull: it neatly sidesteps the generative bottleneck that has constrained latent attacks for years, while keeping the visuals clean enough to glide under most perceptual radars. Code is out, so the only open question is how quickly defenders adapt when “stay on the manifold” stops being a safety blanket.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Latent Geometric Chords for Query-Efficient Decision-Based Adversarial Attacks
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies query-efficient decision-based black-box adversarial attacks that must operate using only top-1 labels. Existing pixel-wise attacks create unnatural, high-frequency artifacts and do not align with human perception, while prior latent-space attacks are constrained by low-dimensional generative manifolds and suffer reconstruction errors from encoder–decoder inversion. These limitations lead to poor visual fidelity, slow convergence and failure against robust models; addressing them matters because practical attackers often only have hard-label access and aim to produce perceptually stealthy inputs with few queries.
Approach
The authors propose Latent Geometric Chords (LGC) and a faster variant LGC-H. LGC operates in the compressed latent space of an autoencoder and carries out a curvature-aware, semicircular geometric search to trace decision boundaries efficiently. A central component, Residual-based Adversarial Generation (RAG), computes the residual between decoded baseline and decoded perturbed latents (a geometric chord) and directly superimposes that residual onto the original image, rather than using the decoder output alone. This both mitigates decoder reconstruction artefacts and expands the admissible search space: mathematically the chord construction raises the Hausdorff dimension of perturbations to at most 2k when the generator is k-dimensional. Experiments use ImageNet, Places365 and CelebAMask-HQ; target models include ViT, ResNet-50, VGG16, DenseNet and ResNet-18. Evaluation metrics include attack success rate versus queries, L2 norm, Structural Similarity Index Measure (SSIM) and LPIPS perceptual distance.
Key Findings
- LGC produces high-fidelity adversarial examples with very few queries: reported SSIM exceeding 0.99 and LPIPS below 0.01 at 5,000 queries in some settings, and rapid convergence within roughly 2,000–5,000 queries under strict perceptual constraints.
- RAG substantially reduces decoder reconstruction errors and effectively doubles the latent-search dimensionality to a space of chord residuals, enabling navigation of adversarial regions that standard latent optimisation misses; the paper proves the chord set has Hausdorff dimension at most 2k under a Lipschitz assumption on the generator.
- Compared with state-of-the-art baselines, LGC achieves far better perceptual stealth while remaining competitive or superior in L2 norms—targeted attacks on ResNet-50 report up to a sixfold reduction in perturbation magnitude versus prior methods while maintaining near-perfect structural similarity. LGC also successfully attacks adversarially trained Vision Transformers and generalises across datasets using a single ImageNet-trained autoencoder.
Limitations
The approach depends on a pre-trained autoencoder backbone; performance varies by backbone (VGG16 latent space gave more predictable, visually coherent results than ResNet-50). The theoretical expansion to 2k dimensions assumes Lipschitz continuity of the generator. RAG mitigates but does not eliminate all reconstruction risks and the method still requires thousands of queries, which may be impractical in some deployed settings.
Implications
Offensive security implications are clear: attackers with only hard-label access can craft highly stealthy, semantically grounded adversarial inputs that preserve human-perceived image structure and require relatively few queries to succeed, including against adversarially trained models. Defences that rely on low-dimensional latent checks, reconstruction errors or focus solely on pixel-level perturbation norms may not detect or prevent such attacks. The authors release code, enabling reproducible assessment and threat modelling of vision systems against this more stealthy class of decision-based attacks.