X

Voice Double: The Development Of An AI Voice

Artificial voices have come a long way from the super-robotic to the almost human. Google Cloud Text-To-Speech has 180 voices across 30+ languages and variants, Amazon Polly has “dozens of lifelike voices across a variety of languages,” and IBM Watson enables your system to speak like a human in multiple languages and dialects. Chinese companies Baidu and iFlytek are even further ahead with voice development with Forbes reporting in May 2019 that it now takes Baidu’s Deep Voice only 3.7 seconds of audio to clone a voice.

The implications for authors and the publishing industry include AI-narrated audiobooks which will lower costs, expand content production and increase rights licensing for audio. In this article, I'll share my Voice Double created by Descript.

You may have seen some of the deep fake videos posted online by Instagram user @bill_posters_uk. They include Donald Trump, Mark Zuckerberg, Kim Kardashian and Boris Johnson, among others, pronouncing things they would never say in real life. It’s not just the video that is faked, it’s also the sound of their voices.

In December 2019, Amazon announced the release of the Samuel L. Jackson Alexa Skill. Just say, “Alexa, introduce me to Samuel L. Jackson,” choose explicit or clean language options and then ask him some questions.

The responses are not necessarily a recording. As technology blog, The Verge, reported, “Instead of relying entirely on prerecorded phrases, the Samuel L. Jackson voice is powered in part by Amazon’s neural text-to-speech model. It’s like a lightweight deepfake, but the actor obviously gave his permission to stand in for Alexa’s standard voice.”

It’s not just celebrities who can create synthesized voices — podcasters and audiobook narrators can do it, too. In fact, anyone with enough data to train a voice algorithm can work with one of the voice synthesis companies to create a voice.

Descript.com allows creators to edit audio by editing the text transcript, as well as automatically removing filler words from the recording by identifying words like ‘um’ in the transcription. It also has an OverDub feature which allows a podcaster to correct audio by typing, powered by Lyrebird AI, a company specializing in voice synthesis that Descript acquired in 2019.

Since I have many hours of voice recording, Descript made me my first Voice Double in late 2019. Over time, as I record more and add more data to the algorithm, the quality will improve and hopefully, I will be able to license my voice to narrate other people’s audiobooks or play a part in a podcast drama.

Voice Double trained on non-fiction audiobook narration [VD128]. Nov 2019
https://api.soundcloud.com/tracks/739197709

This example sounds pretty good but the intonation is wrong and it's a bit flat. Here are some of the responses from some of my Patreon supporters who heard it first:

“That was amazing. I expected a slightly creepy, non-human sound with an attempted accent. It was surprisingly good. Can't wait to look back on this in a year.”

“It won't be long before your avatar is interviewing your guest's avatar and automatically posting it as a podcast! Wait, is that good or bad?”

“First impressions, straight off, definitely your voice. The quality and timbre, It sounds like you. What it needs to learn still is the inflections that you use, which may (probably) be typical to the region you grew up in, and then, of course, the UK but then also your own personality which expresses itself also through speech.”

“Oh wow, gosh! if you hadn't said your twin was speaking I am not sure I would have fully noticed this is AI, I would have thought you were a bit tired, as it sounds a bit labored. I will definitely be watching how this goes. All so exciting and scary.”

“Oh. My God. That is scary. Soon, the machines won't need us anymore – you've stepped into The Matrix!”

“That’s truly amazing! I wonder if it will be able to truly capture you over time to where listeners won’t know the difference.”

“Wow. It *IS* you. Just robotic-sounding, with pauses too long between words and an intonation level that's too similar, so a bit monotonous. But yes, the future is upon us, whether we like it or not.”

Voice Double trained on podcast intros and solo podcast shows [VoDo195]. Nov 2019


In order to try for more natural intonation, we loaded up my podcast intros and solo podcast shows which are more extemporaneous than audiobook narration. This example has more intonation and breath sounds, so it's more natural, but the speech is less coherent.

A conversation between 2 Voice Doubles. August 2020

In August 2020, Descript released Overdub out of beta, and with 30 minutes of training, you can have a viable Voice Double. Author and podcaster Mark Leslie Lefebvre is a friend of mine and we've talked lots about AI in the past, so we made our Voice Doubles and then had a conversation between them.

Click here to listen and read the notes.

Want to know more about the possibilities of voice?