Photo by Matt Botsford on Unsplash

I got a computer to ‘deepfake’ my voice and this is what happened

Matthew Kershaw
4 min readSep 10, 2020

--

For those who don’t know, Descript is a software tool which allows you to edit podcasts and other spoken word content much more easily than traditional audio editing applications.

Its main feature is that by using machine-learning, it converts the sound wave into a written transcript. This allows the user to edit the audio by editing the text. Pretty nifty.

Descript’s interface

One new thing that you can get if you subscribe to a Descript Pro account is a feature called ‘Overdub’.

Thanks to Descript’s purchase of Canadian AI startup Lyrebird, the Overdub feature allows the system to ‘learn’ your voice.

It takes 30 minutes of standing there, reading a script for the machine to know how you talk.

But once you’ve done that, whenever you record something, if you need to add or change words, there’s no need to re-record. Descript will generate the new words for you.

It’s as simple as just typing those words into the script and — bingo — they appear in the audio as if you’d said them.

It’s like deepfakes but for voices.

Or at least that’s the promise.

This is technology very much in its infancy after all.

So I gave it a go, and here is my experience….

First of all, reading a script for 30 minutes was a lot harder than I imagined it would be. Hats off to those audiobook artists who have to read hours of material. Or indeed Sir David Attenborough, whose Blue Planet script is used by Descript as a training tool.

A face for radio: as you can see, I don’t have access to a professional studio.

This barrier prevents the Descript system being used for nefarious purposes — you can only deepfake someone’s voice who’s prepared to read 30-minutes’ worth of content.

Secondly, it does really reproduce your voice in a scarily accurate way.

I reproduced, entirely from text, two speeches.

The first was Renton’s opening monologue from Trainspotting (x-rated. obviously).

The second, for contrast, was Churchill’s ‘Blood, Toil, Tears and Sweat’ speech, his first speech to the House of Commons after becoming Prime Minister

I repeat: I DID NOT SPEAK THESE WORDS, THEY WERE COMPUTER GENERATED IN MY VOICE.

Both of these do have a computer-generated feel to them, but I was surprised how natural they sounded.

I did have to make some adjustments, mind.

One of my main criticism of Descript overall is its current lack of support for UK voices –or indeed, any non-US accents. So, I had to convert things like the US-style pronunciation of ‘process’ as ‘prawcess’ to the British ‘proh-cess’.

And it also read ‘blood, toil, tears’ as in ‘tears paper’ rather than as in ‘crying tears’.

(And those Trainspotting f-bombs required a work-around Descript’s slightly prudish AI)

But overall, I would say successful.

Where the system was weaker was inserting or correcting existing recordings

Here’s me reading the first paragraphs of this article but with several, er, notable errors.

And this is me, correcting the errors using Overdub.

Unfortunately the inserts are painfully obvious. They are in a different tone and the background ambience is different.

I trained my voice in the same location and using the same microphone, just 24 hours apart. The mic was probably slightly differently positioned. And obviously when you record a small segment you give it a slightly different energy to a 30 minute documentary voice over.

I should add that Overdub does allow you to train the same voice but with different emotions — so I could have an ‘angry’ Matthew voice or an ‘energetic’ voice. But this wouldn’t get over the problem of ambience.

The only way round that, at the moment, is to record in a professional environment.

It would be great if Descript’s Overdub feature could also emulate the background/ambience.

So, the system isn’t perfect. But this is nascent technology and one which will only get better over time.

The implications are huge. There are obvious uses for the tech in voice-over work.

In business, it could be used to ensure that all spoken-word communication is consistent, without needing to worry if the original actor is available for this new read.

If I were an actor, I would be looking to somehow protect my voice, or maybe, to put it more positively, find a way to monetise it in this new space

I, for one, am looking forward to mailing in future podcast interviews rather than having to actually do them myself…

If this was interesting, please leave a comment or get in touch

And please clap or share this article if you have enjoyed it as it helps other people find it.

If you’d like to be on my (100% spam-free) mailing list to be updated when a new article is published, you can sign up here.

--

--

Matthew Kershaw

Consultant, advising AI-powered businesses and those who want to use the power of AI — particularly in the creative industries https://bit.ly/MatthewKershaw