SONAR

Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection

ICML 2026 (regular track)

Ido Nitzan Hidekel · Gal Lifshitz · Khen Cohen · Dan Raviv

Tel Aviv University — Schools of Electrical Engineering and Physics & Astronomy

GitHub Live demo (coming soon) Paper (PDF)

TL;DR

Modern speech-synthesis systems leave subtle high-frequency (HF) artifacts that frequency-agnostic detectors ignore — a manifestation of spectral bias. SONAR is a dual-path detector that fuses an XLSR content branch with a parallel branch driven by learnable, value-constrained SRM high-pass filters, and trains them with a Jensen–Shannon alignment loss that pulls genuine LF/HF representations together while pushing fake ones apart. SONAR achieves state-of-the-art performance on ASVspoof 2021 and In-the-Wild, converges 4× faster than strong baselines, and degrades gracefully under codecs and bandwidth shifts.

Motivation

Generative voice cloning is now cheap, fast, and convincing. The 2024–2025 election cycle brought a wave of political audio deepfakes; voice-cloning fraud has caused multi-million-dollar losses at corporations and call-centres. Yet most existing detectors collapse the moment they see a generator they were not trained on — a generalisation failure rather than a capacity one.

The pattern they share: networks fit low-frequency structure first and under-utilise the subtle high-frequency residuals where vocoders leave their fingerprints (the frequency principle). Real and synthetic speech differ not only in marginal HF energy, but in the joint LF–HF consistency of the signal — a relationship that frequency-agnostic detectors systematically miss. SONAR is built around this observation: instead of treating HF residuals as auxiliary features or aggregating them late, we couple them to semantic content during training itself.

Architecture

SONAR dual-path architecture: content branch + noise branch + cross-attention + AASIST classifier
Figure 1 — Dual-path overview. Audio is processed in parallel by a content branch (XLSR encoder on the raw waveform) and a noise branch (learnable SRM high-pass filters → XLSR encoder). Their embeddings are fused via cross-attention (CA) and classified by AASIST. During training, a Jensen–Shannon alignment loss pulls the two embeddings together for genuine speech and pushes them apart for deepfakes.
Rich Feature Extractor — bank of constrained SRM high-pass filters
Figure 2 — Rich Feature Extractor (RFE). M learnable SRM filters of length 5; the central tap is fixed to −1 and a zero-sum constraint is hard-projected after every optimiser step, so each filter stays strictly high-pass throughout training. A 1×1 convolution then aggregates their outputs into the noise branch's input.

Headline results

EER (%, lower is better) under a strict single-run protocol — no checkpoint averaging, no ensembling. Bold is the best per column.

ModelDF EER ↓LA EER ↓In-the-Wild EER ↓
XLSR + AASIST (baseline)3.691.9010.46
XLSR-Mamba1.880.936.71
SONAR-Full1.571.556.00
SONAR-Finetune1.451.205.43

XLSR-Mamba retains the lead on the in-domain LA partition (it was trained with checkpoint averaging on multiple seeds); SONAR is strongest where it matters most for deployment — out-of-distribution detection (DF and In-the-Wild).

Key findings

Where the HF signal lives inside SONAR

SONAR's central claim is that audio deepfake detectors miss high-frequency artifacts because the encoder itself is HF-blind. To test that claim directly — and to ask whether SONAR's training fixes the blindness — we run a simple probing experiment on the trained encoders.

Setup. We freeze the encoder, take each clip, high-pass filter it at fc = 4 kHz (everything below 4 kHz removed), feed it through the encoder, and mean-pool the resulting sequence into a single fixed-length embedding. We then train a small linear classifier on those embeddings to distinguish real from fake on the In-the-Wild test set. If the encoder genuinely carries HF cues, the linear probe will discriminate well; if the encoder discards HF information, the probe will be near chance.

What we vary. We compare three encoders and sweep how much training data the probe sees (10%, 25%, 50%, 100% of the dev set). More data should help if the HF signal exists in the embedding but is faint and hard to extract from few samples.

HF-probe EER vs training fraction for three encoders: baseline XLSR (red), SONAR content branch (green), SONAR noise branch (blue)
Figure 3 — HF-probe EER as a function of training-data fraction (fc = 4 kHz). Lower is better. Each point is the equal-error rate of a linear probe trained on the (frozen) encoder's mean-pooled embeddings of high-pass-filtered audio. We sweep the training-data fraction from 10% to 100%. The grey dotted line is the corresponding LF probe (low-pass) on the baseline — it saturates immediately, showing LF cues are trivially separable.

Two findings.

Reading these together. SONAR doesn't fix HF blindness inside the content encoder — it compensates for it architecturally, by allocating a parallel encoder to the HF residual and tying its distribution to the content distribution through the Jensen–Shannon alignment loss. The HF discriminative load that the baseline cannot carry ends up in SONAR's noise branch.

Citation

@inproceedings{hidekel2026sonar,
  title     = {{SONAR}: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection},
  author    = {Hidekel, Ido Nitzan and Lifshitz, Gal and Cohen, Khen and Raviv, Dan},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026}
}