Baidu's Deep Voice can quickly synthesize realistic human speech

baidu-and-039;s-deep-voice-can-quickly-synthesize-realistic-human-speech photo 1 Getty Images

Baidu has been quietly working on other projects besides self-driving cars at its AI center in Silicon Valley, and now it has revealed one of them to MIT's Technology Review. Apparently, the Chinese tech titan has created a text-to-speech system called Deep Voice that's faster and more efficient than Google's WaveNet. The company says Deep Voice can be trained to speak in just a few hours with little to no human interaction. And since Baidu can control how it speaks to convey different emotions, it can (quickly) synthesize speech that sounds pretty natural and realistic.

Google's WaveNet can also synthesize realistic human speech, but it's quite computationally demanding and hard to use for real-world applications at this point. Baidu says it solved WaveNet's problem by using deep-learning techniques to convert text to phenomes, the smallest unit of speech. It then turns those phonemes into sounds using its speech synthesis network. The system converts the word "hello," for instance, into "(silence HH), (HH, EH), (EH, L), (L, OW), (OW, silence)" before the speech network pronounces it.

Both steps rely on deep learning and don't need human input. However, the system doesn't control which phonemes or syllables are stressed and how long they're pronounced. That's where Baidu steps in -- it switches them around to change the emotions it wants to convey.

While the company says Deep Voice has solved WaveNet's problem, it still requires a ton of computing power. A computer has to generate words to say in 20 microseconds to mimic human-like interaction. Baidu's researchers explain:

"To perform inference at real-time, we must take great care to never recompute any results, store the entire model in the processor cache (as opposed to main memory), and optimally utilize the available computational units."

Still, the researchers believe real-time speech synthesis is possible. They've already created quickly generated samples and collected feedback through Amazon's Mechanical Turk. They asked a large number of people through the service to rate the quality of their samples, and the results indicate that they're of excellent quality.

Tips General

Baidu's Deep Voice can quickly synthesize realistic human speech

Recommended stories

Business Choice Awards 2017: Voice over IP (VoIP) Systems

Nvidia's Latest AI Module Can Fix Video Conferencing and Lettuce

Pinterest Boosts Related Pins Using Deep Learning

4 Ways to Quickly Install Your Desktop Programs After Getting a New Computer or Reinstalling Windows

More stories

Nintendo Switch: A New Gaming Console Announced by Nintendo

21 Silicon Valley Women Who Are More Qualified to Be on Your Board Than Mark Cuban

3 Effective Ways to Manage Employee Burnout

What's the solution to fake news?

'Left 4 Dead' character returns to haunt 'Dead by Daylight'

Exclusivity in Modern Consoles: Who's Reigns Supreme?

How Indoor Location Systems Have Evolved in the Last 5 Years

Burger-flipping robot has its first day on the job in California

6 Intriguing Phones From MWC You Can't Buy

Best Holiday & Winter Tech-Gifts I Received This Season

Recent Post

Recent news