VoiceKey: Proving You’re Human
Or — why the voice-deepfake arms race is unwinnable, and what we’re building instead
Voice cloning crossed the realism threshold quietly. ElevenLabs, the major proprietary voice models, and a half-dozen open-source clones now produce synthetic speech that passes casual human review and increasingly passes automated detectors. The standard industry response has been to train ever-larger classifiers on ever-larger corpora of synthetic audio. This is an arms race, and structurally it’s one that defenders cannot win: every new generative model release resets the detector’s training distribution and the cycle begins again.
VoiceKey takes the opposite approach. Instead of detecting AI, we detect the absence of human-only signatures.
This post walks through the thesis, the working POC code, the broader system architecture, and where to start in the repo if you want to engage with the work.
The negative detection thesis
A human voice is not a clean signal. It’s the output of a wet, chaotic, quantum-perturbed biological system. Specifically:
Physiological micro-tremors. The larynx and vocal tract aren’t stable platforms. Muscles fasciculate. Tissues vibrate against each other. Glottal pulses vary cycle-to-cycle in ways shaped by blood flow, breath, neural firing patterns, fatigue, and emotion.
Quantum-level randomness. Biological processes operate at scales where genuine quantum indeterminacy enters the signal. This is not pseudorandom. It’s not seeded. It cannot be replayed.
Non-linear dynamics. The human vocal apparatus is a chaotic system in the mathematical sense — small perturbations produce non-linear outputs. The interaction between airflow, vocal-cord tension, tract resonance, and articulation produces signals with measurable chaotic structure.
AI speech synthesis, by contrast, is the output of a deterministic computational process. Even with sampling temperature and added noise, the underlying generator is a smooth function approximator trained to produce plausible-sounding audio. It cannot produce true biological complexity any more than a sine wave can produce a thunderstorm.
The bet underneath VoiceKey is that this gap is detectable and, more importantly, persistent. New generative models can get better at approximating human-sounding output. They cannot become biology.
The POC: fractal voice analysis (working code, MIT-licensed)
We released a proof-of-concept implementation that validates the thesis empirically. The repo is VoiceKey-Fractal-Detection, MIT-licensed, ~200 lines of Python.
The approach is straightforward:
Load the first 60 seconds of a
.wavfileSlide a window across the signal at multiple time scales (1s and 3s in the current build)
For each window, compute two well-established complexity measures:
Higuchi Fractal Dimension (HFD) — quantifies how self-similar and structurally complex the signal is across scales. Higher HFD indicates more repetitive, structurally predictable fractal patterns. Lower HFD indicates more genuine variability.
Detrended Fluctuation Analysis (DFA) — measures long-range correlations after removing overall trends. Higher DFA indicates stronger long-term structural correlation. Lower DFA indicates more independent fluctuations.
Determine adaptive thresholds from the distribution observed in the sample
Classify each window and aggregate
When run against my own voice and against ElevenLabs renderings of the same script, the result is consistent and counterintuitive to most people’s first guess: AI-generated voices score higher on both HFD and DFA. Synthetic audio is more fractally structured, more long-range-correlated, and more pattern-rich than human voice. Human voice is messier, less self-similar, less predictable.
This is the negative detection signal in concrete numerical form. We aren’t finding what AI added. We’re finding what biology contributes that AI can’t fake — namely, the lack of artificial structure.
Some honest scoping: sample size is small. Detection works in our test conditions; it has not been stress-tested across diverse speakers, languages, recording environments, or against adversarial models specifically trained to add biological-looking noise. We say this explicitly in the repo. The POC exists to prove the signal exists, not to ship a production system.
The full VoiceKey stack
The fractal analysis is one layer of a larger architecture. The full design:
Layer
Purpose
Implementation
1) MFA + biometric pre-screen
Reject obvious illegitimate attempts before spending compute on the heavy layer
Standard auth stack
2) Analog voice capture
Force physical presence at the analyzer; capture nuance digital recording can’t reproduce
Hardware-bound recording, not file upload
3) Negative detection
Identify absence of human signatures
HFD/DFA + future complexity measures
4) Zero-Knowledge Proofs
Allow verification without exposing biometric data
Standard ZKP libraries
5) Blockchain anchor
Immutable record of verifications without leaking content
L2 EVM-compatible chain
The most important property of this stack is computational asymmetry. Running the verification is expensive — hours of compute per sample in the current POC. But evading it would require an attacker to produce audio that fools the fractal analyzer and survives the analog capture loop and defeats the ZKP-verified MFA layer. The defender’s cost is high but bounded. The attacker’s cost grows exponentially as you stack the layers.
This is the same principle that makes proof-of-work mining and password hashing work: verification feasible, evasion intractable.
The README walks through the security analysis across 1, 5, 10, 50, and 100-year compute-growth horizons. Short version: the asymmetry holds even under aggressive projections of attacker compute, because each additional layer in the stack multiplies the evasion cost while only linearly adding to the verification cost.
Why this matters
The proof-of-humanity problem stopped being academic in 2024. Voice fraud is already costing companies real money. Deepfake audio already shapes elections and corporate scams. The cost of producing convincing synthetic audio has dropped to roughly zero. Every voice authentication system trained or designed before ~2023 should be assumed compromised.
The forward options are limited:
Keep training bigger classifiers (loses the arms race; loses faster as generative models improve)
Move to non-voice biometrics (privacy regressions, accessibility regressions, infrastructure cost)
Bind verification to physical presence and biological signal properties that can’t be synthesized at all
We think option three is the only stable answer, and we think the cryptographic and analog primitives needed to build it already exist in production-grade form.
Where to start in the repo
If you’re a developer who wants to engage:
VoiceKey/README.md— full technical overview, MFA layer, ZKP/blockchain design, security analysis across compute-growth horizonsVoiceKey-Fractal-Detection/analyze.py— the working classifier. Run it against your own voice and an ElevenLabs clone of the same scriptvoicekey-test-narrative.md— the standard test script, so your samples are comparable to oursfractal_analysis_comparison.md— empirical findings with HFD/DFA visualizations
We’re explicitly looking for:
Diverse voice samples across languages, accents, and recording conditions
Threshold refinements (the adaptive thresholds are a starting point, not a final answer)
Additional complexity measures beyond HFD and DFA
Adversarial samples — synthetic audio specifically designed to add biological-looking noise. This is the strongest possible test of the thesis, and we want it run
MIT-licensed. No CLA. No token. No funding sources to protect. Open an issue or a PR.
VoiceKey is part of the AI Integrity Alliance research program. We build open-source verification infrastructure for trustworthy AI and proof of humanity. We take no external funding. Just MIT-licensed code and the research behind it.

