The estimated $3.5 billion in annual sales of audio books is still a small fraction of the $135 billion generated globally from printed books. Audio books mainly have been the domain of publishers with the budget for costly and time-consuming recordings by professional actors and voiceover specialists. The result has been a narrow range of titles and a high consumer price tag.
DeepZen, a UK-based startup, is looking to bring audio books to the masses by using artificial intelligence to synthesize the human voice in order to replicate emotions and intonations, helping slash the cost and increase the speed of production. For example, the company’s website features clips from Franz Kafka’s The Metamorphosis, presented in a female voice.
The startup also partners with media outlets, gaming companies, and advertising agencies. Other potential use cases include newscasting, marketing, documentation, training, and assisting people with learning, reading, or sight disabilities.
What DeepZen needed was a flexible, scalable high-performance computing (HPC) platform on which it could use neural networks and natural language processing to train its “cloned” voices to articulate emotions and expressions.
We are highly dependent on high-performance computing because we are a machine-learning company. There’s lots of video, animations, and advertisements that need voiceovers. We are able to create voice very quickly. And we are doing it on Oracle Cloud and its GPU service.
CTO and Cofounder, DeepZen
DeepZen joined Oracle for Startups in 2019, benefitting from discounts and credits to establish its platform on Oracle Cloud High-Performance Computing (HPC), which runs on Oracle Cloud Infrastructure (OCI), at a time when cash was tight but resource requirements were high.
“Oracle’s startup program was very important to us and a big gesture,” says Kerem Sozugecer, DeepZen CTO and cofounder. “The support, advice, engagement, and contacts derived from being part of the program really made a difference on our journey to scale.”
The startup also found that Oracle Cloud Infrastructure met its data center requirements by seamlessly switching between servers in different locations with the least delay. OCI’s auto-scaling feature allows the company to easily adjust the number of compute instances according to demand. “It’s very fast to get started, and you can launch as many servers as you want within minutes,” Sozugecer says. “If you want to stop instances when not needed, you just scale down and that way costs are kept under control.”
DeepZen was one of the first companies to try out Nvidia’s A100 Tensor Core GPUs, made available on bare metal instances on Oracle Cloud Infrastructure. Early tests in the Oracle data center boosted the performance of the company’s voice models by 36%.
“Our auto-regressive neural-network acoustic model now takes only five days to complete versus seven previously,” Sozugecer says. “If you factor 36% acceleration into all our training, we are saving a whole month every three months. For a startup like us, that’s really significant.”
DeepZen has since switched all of its servers to A-100 GPUs and continues to scale with growth. “We can’t survive without them,” Sozugecer says.
Running its high-performance computing on OCI not only has reduced voice-training times, but it also has accelerated the recording of audio books. A 10-hour audio book that used to take 65 hours to narrate can now be created in only one hour. As a result, DeepZen is making it economical for midsize publishers that can’t afford to hire actors and build or rent recording studios to break into this market segment.
CEO and Cofounder Taylan Kamis estimates that companies using its technology will be 75% more efficient in terms of the time and cost of producing audio books. The effect for consumers will be commensurate, opening an affordable alternative to the printed page. “As we gain efficiencies and the marginal cost of audio plummets, we expect a natural conversion of audio to complement other formats,” Kamis says. “Listening is one of the core things people do, and our technology is removing barriers in many markets.”
DeepZen offers a software-as-a service model through API integration with its platform, as well as end-to-end management of audio production and customized services. A radio news organization in North America is using the DeepZen platform to clone six of its reporters’ voices, freeing them from having to record their scripts. The station can pick and choose voices, including regional accents, using an API.
Kamis notes that the company isn’t disenfranchising the actors, narrators, and voiceover artists who allow their voices to be cloned by its AI techniques. “We made a strategic decision to license their voices and remunerate them for every project where their voice is used,” he says.