Podcast: Download (Duration: 42:37 — 34.9MB)
Subscribe: Spotify | TuneIn | RSS | More
Is digital narration with AI voices good enough for non-fiction or fiction audiobooks? Can human narrators benefit through voice licensing? What are the options for sales and distribution?
Taylan Kamis from Deep Zen explains digital narration for audiobooks, and I share some samples from my digitally narrated books through Deep Zen.
Taylan Kamis is the CEO and co-founder of Deep Zen Limited.
You can listen above or on your favorite podcast app or read the notes and links below. Here are the highlights and the full transcript is below.
Show Notes
- What is Deep Zen?
- How good is the quality of AI narration for non-fiction — and for fiction, where emotional resonance is so important?
- How AI narration may benefit human narrators with voice licensing, and Deep Zen's Ethical Statement
- Where can you distribute and sell AI narrated audiobooks — and when will this expand for indie authors?
- Cost and revenue related to AI narration
- When will AI narration become mainstream?
You can find Taylan Kamis at deepzen.io and on Twitter @DeepZen4
I also included samples of my fiction and non-fiction books that Deep Zen has digitally narrated. You can listen on the links below:
- Co-Writing a Book: Benefits of Co-Writing. Digitally narrated by Alice (British female)
- Sins of Violence, a short story. Digitally narrated by William (British male)
If you'd like to listen to the entire books, you can purchase the digitally narrated audiobooks for 30% off using discount coupon: 2021 (until the end of Dec 2021) directly from me at Payhip.com/thecreativepenn. [Here's how to apply the coupon.] The audiobooks are delivered by Bookfunnel and you can listen on their free app.
Transcript of Interview with Taylan Kamis
Joanna: Taylan Kamis is the CEO and co-founder of Deep Zen Limited. Welcome to the show, Taylan.
Taylan: Thank you. Thanks for having me.
Joanna: It's so interesting to talk to you.
What is Deep Zen and why are you so passionate about AI for voice?
Taylan: Deep Zen is a synthetic voice company focusing on creating human-like speech and emotion using AI.
As a background, I was always interested in human-machine interaction. So the starting idea was to build a system that can read any text as a human would do, and it will be indistinguishable.
Then we start looking into the technical side; what are the obstacles, whether the technology is there. And in the last three, four years, the deep learning, the AI has come so far that we achieved what we were aiming for three, four years ago, sooner than we were expecting.
I think it is about productivity, making it easier to create audiobooks. So if you think about the content production on six, seven hours of recording, editing for one hour of finished content. Historically, I think it's hampered growth on the availability of audio content, especially non-English languages.
I would like to remove the barriers around audio content creation and make the content available to a wider audience with a better and wider selection.
So it is about I think, the high level, it's about having a choice and availability.
And it's also, I think when you introduce a new technology, it also opens up new ways of looking at different businesses. So publishing is one of our verticals. We also work as a platform company supporting online education businesses, synthetic video companies.
But even in publishing, for example, there are some use cases that some publishers are actually experimenting with having an early audio version of the books that are going to be released next year, for example, next season as an advanced copy and send them to the buyer. If you're a buyer who needs to go through 12 books a week, it is also about convenience.
I think making those content available to the wider publishing community in different use cases is also about improving productivity, making life easier for people.
Joanna: I love that idea of the digital ARC. I've heard that a couple of times now and we all think about selling them later on, but you're right. There are these other use cases.
Let's talk about what some people call the ‘quality,' because you said that what you envisioned three or four years ago has now come to pass. And certainly, I'll be sharing my examples from Deep Zen on this show.
Where do you think we are? People seem to think that nonfiction is better than fiction. Is that about the technology or is that about expectations of listeners around actors and that kind of thing?
Where's the quality right now with AI narrated audio?
Taylan: Currently we get great results with nonfiction with little editing. We have human inputs in our processes, so you get the first natural language processing and the speech system giving you the first version.
Although we are a voice company, we also build a system that can analyze the context, looking into the characters, identifying the genre. So all that information is also passed to the system.
Then we have the human inputs in terms of the editing, but more and more, it's actually, especially for the nonfiction titles, the output that we are getting from the system is good enough that very little touch points are required, which will actually help the authors because it means that we can provide it at a more reasonable cost point. So it will make the technology more available to the authors.
I think the key difference between fiction and nonfiction is dialogue. So how actually the story is told. We've been working really hard actually on this, the natural language processing system to actually identify the characters. We are moving to the next stage in the development that the AI can change the voice based on the story, the characters.
At some point, I think, we will be available to have multiple voices talking to each other. So it's a development, but you are right, so far, the nonfiction we are getting better results and it needs a little bit more human input for the fiction side, if it makes sense.
Joanna: Absolutely. And, of course, there are fears around this as there are about a lot of AI things. One of the biggest is around AI, these synthetic voices taking jobs away from narrators and voice talent. I know you have an ethical statement on Deep Zen, so I'd love you to talk a bit about that.
What are your thoughts on AI narration taking jobs from human narrators?
Taylan: I think human and AI systems can coexist. We work with the narrator community and we license voices. Once we actually bring a new voice artist into our platform, for each piece of work that we are actually completing using that artist's voice, we pay a royalty scheme back to the artist. And we use pseudonym names and it is actually an additional revenue stream.
Think about the physical limitations of recording and how many titles that voice artists can complete in a year. So if you actually extrapolate and if you introduce AI and make it available in different use cases, different platforms, different countries, it will be an addition. I see it as a contribution to the narrators' mainstream work and they don't have to do anything for that.
We have some cases, for example, especially, for example, on the language training side, we have overseas customers using U.K.-based narrators to create content. So normally those people wouldn't be coming into U.K. markets to hire voice artists, narrators. So that's, for example, an additional revenue stream that wouldn't normally be coming to that narrator's voice side.
And I think it's also about the scale. Just to give you the idea, there are 50 million eBooks, I think about 50 million print titles. And so far, the audiobook numbers are around half a million. And if you think about it, 90% is in English.
And if you think about American's Audible starting this 20 years, 25 years ago,
There's no way you can make all this content, 50 million ebooks available in audio format without having some sort of AI solution involved.
And when you look into the other markets, non-English markets, like, French, German, Spanish, the numbers are really, really miniscule. So I think in Germany, where you have the highest spending per capita for books, we have around 30,000 titles, in French, I think less than 15,000. So there's a huge disparity between the availability in different languages.
I think AI is going to make that content available on a larger scale. And in our model, what we actually want is the narrators to benefit from that. So we want to share the value created by using AI with the narrator community. That's kind our approach from the beginning.
Joanna: Now I love that. And that's why I've chosen to make my first books with Deep Zen, because I like that ethical statement. I think that's really great. And I'm using one of those voices.
And in fact, I agree that we absolutely must get more content into audio, not just in English, but as you say, in all these languages where they don't have such a mature audiobook market.
But even in English, I've just done a short story, which I also read myself. I'm obviously British female. And the voice I used from Deep Zen is a British male, the voice of William. And what I like is the ability to make more versions of our audio in different voices that might either help the story or just that people like to listen in accents that reflect their own life experience.
So perhaps an American female might like to listen to that voice, whereas I like to listen to British female. So I see that there are also opportunities for multiple voices within a language or even multiple accents, for example, how many accents there are in each different language. I see this as almost, even bigger than just language.
Taylan: Yes. The other example I would give is the Edward Herman example. Edward Herman was a very well-known and liked American artist and also audiobook narrator. He passed away in 2014. We actually got in touch with his estate family. We licensed his voice.
We used old recordings, audiobook recordings. Now his legacy goes on. So we are actually able to produce audiobooks. He's one of the favorite narrators in our system. So it is possible. The AI also makes all these different use cases possible.
One of the other things that we are thinking about is also introducing a narrator program. So working with narrators, creating a digital replica of their voices, and get them on board with our system. Then we are thinking about opening the system, the software to the narrators, and they can actually create books, produce books, using their own voices, but the synthetic versions of them.
It will enable them to actually produce a higher number of outputs. For example, they can keep on doing the human narration or they can use the automated AI version and go and edit the book and they can actually produce it through their own voices. So that's one of the things that we are actually currently exploring about as well.
Joanna: That is absolutely what I want to do. I'll be a customer for you as soon as I can be!
Taylan: You should do it. Yes.
Joanna: Oh, yeah, absolutely. Because for my own audiobooks, because now I've experienced the quality of yours. Even if I just get the AI model to read it, and it's 90%, even if it's 90% or 80% complete, and I'm happy, then there's just the little edits that have to be done. That just makes things a lot easier.
I think that's going to be brilliant and that will put the power into the hands of the narrators because then they can be the ones to say, ‘Okay, I can do more work this way,' as you said.
That brings in a question about what is digital narration, though? Because at the moment, a lot of the big audiobook platforms don't allow digital narration, but if it's actually a licensed voice of a real human that's mastered by an AI or whatever, that seems to cross a line. I wonder what the definition will be at that point.
Taylan: It is an interesting question, actually. I think it's a timing question. It is not ‘if', but ‘when' question for adoption and I think we can refer to some of the research work.
I think the Audio Publishers Association, they did a consumer survey. I think they're saying that the broad acceptance of AI narrators, 81% saying that they would still be interested in listening an audiobook narrated using a high-quality artificial intelligence or AI voice. And they say 58% saying discovering a book that they enjoy, AI narrator, it has no impact on their opinion.
So I think it might be early days, but I think it's going to become norm. As the acceptance increases, I think there's going to be an amalgamation of human voice and the AI voice.
There are some platforms that are actually helping podcast producers to actually edit some of the pick-up sessions or the changes using the synthetic copy of their voices. So it is the human narration plus the AI kind of edit.
What our technology is, it can actually enable the other way around, so you can actually create the whole content. We are advanced enough to give you the ability to create the whole content in a synthetic version of your voice. It's very similar to yours.
Then you have the control, you can actually, if you are doing the editing, then it wouldn't be any different to your own articles. I think it's going to be a convergence of the two in terms of, I think, the near future.
Joanna: In terms of right now, or at least in the next year or two, how are you selling audiobooks, distributing and selling? Obviously, for independent authors, it's not possible at the moment, but I think Deep Zen has some distribution already.
What platforms are allowing AI narrated audio?
Taylan: Starting from day one, we've been really careful about the quality and how we communicate with both the publishers, authors and the retailers. We wouldn't want to become something with subpar quality. So that enabled us to secure the distribution.
With Deep Zen, if you produce the content with us, we can actually distribute our content to 50 different retailers and streaming services and the libraries through the distribution partnership agreements that we signed.
We are actually thinking about making that available to the independent authors. If you are actually working with us and if you're getting your digitally narrated audiobook produced with us, we will be offering that distribution service through our own distribution agreements. And we can make that content available in the different platforms.
So far, it will be restricted to our own distribution channel. But I think with wider acceptance, the other platforms will probably going to be accepting directly in later stages. I think the exception is Audible, so far, and I think they're also looking into it.
They are a big part of the ecosystem, obviously, and they're also starting looking into how they can introduce AI generated content into the platform in the near future.
Joanna: Obviously, they're owned by Amazon, and Amazon have Amazon Polly, which is another AI voice. I imagine that they might want to develop their own ecosystem.
Taylan: I can't go into too much detail. There are some works in process.
Joanna: I'm sure.
Taylan: I think the good thing having Deep Zen available to the authors and the publishers, we are independent. The key thing is if you are going with one of these larger, bigger platforms that it comes with the conditions attached to it. So you have to make it exclusive to one platform, you can't do it in the other one.
We are platform agnostic, neutral. And that's actually a good thing to give the choice back to the authors and the publishers in terms of where they want to list their content, how they can distribute their content, how they can actually maximize the value they are going to be driving from their content, their work.
I'm playing a really critical role in terms of staying independent and providing that platform, the equally good technical platform to the wider publisher and author community.
Joanna: I know you might not be able to answer exactly, but is there a roadmap in terms of timeline for opening up to authors? Because I feel like I've been talking about this for a few years and everyone laughs at me and says, ‘Oh, it's years and years away.'
How long do you think it will be before this is possible for independent authors?
Taylan: Oh, for us, basically using Deep Zen's technology you mean?
Joanna: Yes.
Taylan: It's very soon. We started to make it available. We built a publishing portal. So if you go to portal.deepzen.io and you can actually sign up to our portal, give your business details, upload your manuscript. And that service is actually now available.
Joanna: Yes, that's the service I've just used to do two audiobooks. When would we be able to distribute through Deep Zen to the 50 retailers, for example?
Taylan: We are aiming to do it before the end of the year.
Joanna: Oh, great. Wow. Okay. That's amazing.
Taylan: Yes. There's some infrastructure that needs to be put in place in terms of how we are going to be managing it. So there's some development work that's being currently done, but we would like to do it as soon as possible.
Joanna: Well, that would be great. Then in that case, I want to ask about price because there's a couple of things with price.
On the one hand, people think, ‘Oh, it should be really, really cheap to do AI voice,' where, of course, it's technology and it's got a lot of value. So even though it might be cheaper than necessarily hiring a famous narrator, it's still got costs. So that's the cost on the one hand, will that come down?
And on the other hand, the pricing, is there an expectation that the pricing of an AI narrated audiobook should be cheaper because it's not a human?
What's happening with both the costs and the revenue when it comes to AI?
Taylan: In terms of the cost structure on our side, the human input part, we have two tiers of services. The first one is if you want the quality QC process, if you would like to give inputs and if you want us to change and make edits on the book, then it requires a rigorous process with the editors and it increases the cost on our side.
Currently, it's around $100 per finished hour, $100 to $130, that price range. And the majority is actually the human editing, which is involved in that. But now we are getting confident to a level that we can actually eliminate some of that process. Especially for the nonfiction, the content is actually good enough to be distributed with minimal human touch.
We will be enabling the price point to come down to around $50 per finished hour, $40 to $50 range that we are looking into. But that service wouldn't have the same QC process, but we will be doing the lexicons or the pronunciation is going to be perfect. But small changes maybe, like, if you want to pause after certain words, variable detailed editing, that won't be available, you need to rely on the machine's accuracy.
We are actually getting very confident that with the minimal input, we can get a really high-quality output from that, which we will be passing the savings to the customers.
In terms of the pricing, whether it should be cheaper, the AI content, not necessarily. I think the value of the content, the intellectual value and how it is produced are a little bit separate things. Ultimately if you think about it, the availability and the choice, I think price was one of the barriers that actually stopped more content being available.
My expectation is as we pass along these savings then it will be more feasible to break even and start making money with a smaller number of copies that are sold, which will actually help the authors and the publishers to price them more reasonably. So they will be probably more competitive.
That's how I'm thinking. Long term, it will help the price points to come down while actually keeping the same amount of returns to the publishers.
Joanna: I think there's going to be a lot of different options within the next few years, and that will change, but I agree with you. I think the content is one thing.
The other thing I'm interested in is I used to think that we were trying to replicate human voice and, obviously, we are trying to replicate emotion and that kind of thing.
I think it's important that we label things as digitally narrated, for example, and I'm even putting labels on the audiobook cover so that they can be easily recognized as digitally narrated, because I feel that's also to be embraced.
There are special things about audio narration as there are special things about human narration and we don't want to try and fake it in a way.
Do you think that we need to encourage this kind of different labelling in order to encourage trust from people? Or do you think it doesn't matter because people will just listen anyway?
Taylan: We advise, as the best practice, to do it in the metadata, to actually label it in the way that it's synthesized with the digital voice of the narrator or the pseudonym. That's how we see it to be open and frank with the customer, with the community.
I agree with you. That's something that we are recommending, and I think that showing the distinction would be probably beneficial in the long term.
Joanna: Great. So we've talked a bit about what's happening right now. When do you see this being completely accepted by both the publishing industry and by listeners? Are we talking 5 years, 10 years?
When will AI narration be mainstream?
Taylan: A couple of years, I don't think that it will take five years. I think it will be probably with different use cases, probably more on the fiction side.
We've been working on this for three, four years now and we see the reports. The content is consumed in libraries, and on different platforms. People are actually paying for it. So I think it's just the mindset and also probably the bigger platforms and the publishers are slowly adopting it.
Maybe the change should come from the authors and the narrators that it is actually the more people are actually using it, taking the advantage. And we would like to be the platform that is actually enabling it rather than the big parties controlling it and just doing it in one go.
So I think the way I see it is yeah, the next couple of years, I think it will start to emerge definitely in non-English markets because we get quite a lot of inbound requests from German publishers, French publishers. If you think about the audio, all these services are actually bringing in more people as the subscribers, users and they want to give them a good experience and people want the content in their native languages.
All the studios in Germany, what we are actually hearing is they're all full all the time. Even if you want to pay premium prices and if you get the narrators, there's not enough studio capability or capacity to get your book on time.
I think it will probably be an earlier adoption in non-English markets.
That's what I'm expecting. And that will drive the change probably in U.S. and UK.
Joanna: I totally agree with you. I think the non-English language services will want this more and then it will drive everything forward and then the rest of us will be like, ‘Yeah, we want to get involved.'
Taylan: Our thinking is for the publishers. I think you can get your content in audio now in a more reasonable price point. If you are thinking about rights, we are talking about five to seven years.
There are some platforms that currently you can sell your content. Also, for example, on the Audible question, once they start selling, then you will have an early start, for example, you will have your content rather than trying to get everything adopted at that stage.
So I think early adoption is key in this case and it is certainly not going to be four- or five-years' time. It's going to be next year and the year after. I think we will see a big mass shift in that in a massive scale.
Joanna: Excellent. I've been excited about this for years, so I'm glad it's finally happening.
Where can people find Deep Zen online? Where can everyone find you?
Taylan: Sure. Our web address is deepzen.io. For the authors, they can sign up to our portal, it's portal.deepzen.io. And if you have any queries, then the email is hello@deepzen.io.
Joanna: Fantastic. Thanks so much for your time, Taylan. That was great.
Taylan: Thank you. Thanks for having me. Have a good day. Thank you. Bye-bye.
Bobbie Falin says
The pause between sentences seems just a bit too long. It leaves me feeling like my own breathing is off. I wonder if it could be cut by a tiny bit.
I have subscribed to this program and the extra level several months ago, but have not played with it much. Maybe I should devote more time to it.
Bobbie Falin says
Correction, I bought Speechelo and a package with extra voices. But they are the same voices Deep Zen uses. How is this possible? Are they the same company?
Joanna Penn says
I don’t know, you’ll have to ask them.
lisa says
Honestly, I think the fiction voice sounds way better than the non-fiction one. Really exciting to hear!
Joanna Penn says
Interesting! It also has a lot to do with voice preference and what we notice in an accent.
John Ravi says
Hi Joanna,
Great interview! I think AI voice is the future and it will definitely affect all of us going forward. With the latest technology, the content will be available to a wider audience, and this is what any content creator wants. I am very interested in AI voice technology, and I hope it gets better with time. I am on the side of technology and hope to see this technology in the future. It was a great interview, I really enjoyed watching it.
Joanna Penn says
Glad you enjoyed it!
Matty Dalrymple says
I love how fast listenable AI narration is coming! One consideration for me is that I plan to update my non-fiction books chapter-by-chapter ongoing and would want to be able to generate and upload new audio chapter-by-chapter. If I’m understanding DeepZen’s pricing correctly, their minimum charge per project would make that not financially viable. I’m hoping that over time (probably as their process becomes more strictly AI-generated and reduces any human intervention needed), there will be a straight per word / minute charge with no minimum.
Matty Dalrymple says
Just got to the part of the interview where Taylan mentions this! : )
Karen Commins says
Hi, Joanna! As a professional audiobook narrator, I’d like to offer some important points that authors should consider before embracing a synthesized voice to record your books.
One’s voice conveys the essence of being HUMAN. Nothing expresses our thoughts, feelings, and emotions better than the human voice.
People buy audiobooks because they want to be entertained, informed, and inspired.
An audiobook is a performance art based on the narrator’s interpretation of the author’s words. We do MUCH MORE than read!
Before I ever walk in to the booth to record an audiobook, I’ve carefully prepared for the moment:
I read the entire book:
— In a fiction book, I note all of the characters’ quirks and descriptions so that I can develop a convincing voice for each character based on the author’s clues and present the characters as real people in real circumstances, not some cartoon.
— In non-fiction books, I research the author and the content of the book so that I understand the message to be conveyed.
In either case, I’ve done copious research on correct pronunciations. Anyone who has ever heard a GPS mispronounce the name of their town will be annoyed to have a computer voice mispronounce things in an audiobook. Mispronunciations take the listener out of the story.
Technology is ideal when robots replace humans in soul-sucking jobs like installing computer chips on a circuit board. It will never replace a human’s ability to convey emotion.
Artificial intelligence also can’t detect the SUBTEXT in a single sentence, much less over the trajectory of an entire book.
Words on a page can fall flat and be interpreted in different ways, where a narrator can say the same sentence in a number of ways to impart different meanings using volume, pitch, tone, and pauses. The listener can actually HEAR THE DIFFERENCE when I smile!
Authors carefully choose every word they write. Audiobook narrators work to understand and make organic acting choices that convey the author’s intent with every word.
When an author considers everything she’d lose by choosing an AI voice over a human voice simply to save some money in production costs, I’d hope she’d realize that the true value in an audiobook is in the human narrator’s ability to TELL THE STORY and take the listener on the journey with us.
Cordially,
Karen Commins
http://www. Karen Commins.com
Petrea Burchard says
As both an author and an audiobook narrator, I find it incredibly disappointing to read as an influential author embraces AI voice, which essentially attempts to replace the artist in art. Not that it can. But some people will settle for the lack of soul, and that’s disturbing.
Authors, starting now: when you sign a contract with a publisher, if you don’t want your audiobook narrated by a machine you need to get that in your contract.
If AI can replace the human narrator, it can surely replace the human author.
Those of us who appreciate the artist as well as the art will perhaps require a label on each book: “Written by an actual human!” “Narrated by an actual human!” —because I don’t want to listen to a machine reading words written by a machine, telling me stories about humans and the human condition.
Joanna Penn says
Hi Petrea, as I replied on Twitter — There’s room for all.
90% of books don’t have audio versions in English, let alone other languages and dialects in all those languages. Like ebooks and print, one does not replace the other. Plus, narrators can license voices – as I will do.
I am also a narrator and I hire human narrators.
I have also labelled my AI-narrated books clearly so listeners are not fooled in any way.
Bernard says
Petrea, by the way you and others are writhing about AI it seems to be obvious that none of you have ever even tried to use the AI technology you are attacking. This is a shame, and I’m not totally convinced you’re qualified to comment.
I am a veteran radio broadcaster, voice-over artist, producer and published magazine journalist and author. I have voiced pretty much everything you can think of. I have hired voice talent.
I absolutely love AI text-to-speech and AI writers. Yes, they do actually exist and they write excellent non-fiction articles and blogs and they are even starting to write stories, songs, poetry. AI tech is improving all the time. Yes, one year ago most of the AI voices were gargly and yodelling. There was a period of great disappointment. I cried every day for six months straight. The good news is that they keep training the AIs with more and new data and they keep improving. The DeepZen voices mentioned above are some of the best available at the time of writing. There are other platforms which have really good, near-human sounding voices. As for the AI writer platforms, just a few months ago some of them had difficulty staying on point. More data training fixed it. They can now write long-form too. Plus, you can choose the mood or formality of the writing ‘voice’ so to speak. The improvements and upgrades are arriving weekly!
I don’t fear being replaced at all. I have more than one skill. If you love the business of book narrations start to think of ways of supplying the AI narrations. Clone your own voice so you don’t have to spend a week or two voicing yourself hoarse, LOL. Think of the time you’ll save and you would be able to take on more projects. A good voice clone will cost around $500 and you’ll need to supply around 20 hours of data which, you should already have from all those books you’ve voiced, right? An affordable investment for tripling your income from book narrations. You can thank me later 🙂
A hundred years ago, did radio kill off the theatre? Did TV replace the radio? Has the Kindle sent physical books into oblivion? No, the internet has changed everything. An on-going process. All the forms of media and entertainment will continue to co-exist forever but in an online form. They are already working on AI actors. Actually AI video presenters and hosts already exist in AI text-to-video tech. It’s actually quite good. They are currently overcoming a lip-sync problem, but soon they will be human-like.
Joanna Penn says
Thanks, Bernard, and it’s good to hear that you are also enthusiastic about AI for voice and also for writing and other tasks.
Interesting times, indeed.