we spend SO much time on this forum (and everywhere else) debating AI images but honestly the audio side is what keeps me up at night
saw a demo last week where someone cloned a CEO’s voice from a 30 second earnings call clip and then generated a completely fake phone call authorizing a wire transfer. the whole thing took less than 5 minutes to set up. five. minutes.
there are some detection tools coming (Resemble AI Detect, Pindrop) but theyre all enterprise priced. for regular people theres basically nothing.
and the detection challenge is even harder than images because phone call compression and voice message compression literally destroys the artifacts that detectors look for. so even if you had a good detector it probably wouldnt work on most real-world audio.
voiceover artists, podcasters, musicians — yall paying attention to this? how are you thinking about protecting your voice?
Voiceover artist here, been in the industry 12 years. Yes we’re paying attention and yes we’re terrified.
I already found my voice on an AI voice marketplace last year. Someone had cloned it from my demo reel on my website. Took me 4 months and a lawyer’s letter to get it taken down, and I’m still not sure they actually deleted the model.
There’s basically no legal framework for voice cloning yet. A few states have laws but enforcement is a joke.
The wire transfer scenario is real and already happening btw — there was a case in 2024 where a Hong Kong company lost $25 million to a deepfake video call where the “CFO” authorized a transfer. That wasnt even audio-only, it was a full video deepfake on a zoom call with multiple cloned participants.
And yeah the enterprise detection tools cost $$$. For consumers there’s nothing, and I don’t see that changing anytime soon because the consumer market cant support the R&D costs.
One thing that might help: establishing verbal authentication protocols. Like a family safe word for phone calls. Sounds paranoid now but probably wont in 2 years.
I make music and this terrifies me too. Drake’s AI-generated track showed that you can clone a recognizable artist voice and it’ll go viral before anyone does anything about it.
What bugs me is the asymmetry. Creating a voice clone: minutes. Detecting one: expensive specialized tools. Getting one removed: months of legal back and forth. The incentives are completely broken.
From a technical perspective, audio deepfake detection is harder than image detection for a few reasons:
- Audio compression is lossy and aggressive — phone calls are 8kHz, most detectors need 16kHz+
- Background noise masks artifacts
- Voice is inherently variable (you sound different when tired, sick, emotional) so “normal” has a wide range
- Real-time detection is computationally expensive
The most promising approach I’ve seen is challenge-response: the detector asks the speaker to say something specific in real-time. Current voice cloning can’t handle truly arbitrary real-time conversation without latency tells. But that window is closing fast.