Tacotron 2, the acceptance of AI-based personal assistance systems should increase if they sound like real people. Google is taking a big step forward: the latest synthesis speech output can hardly recognize as a robot voice.
Tacotron 2 New speech synthesis: Google AI sounds like a human now
Google’s unique sound AI Tacotron 2 uses a visual text-to-speech process. In a first step, the system creates a spectrogram, which is a graphical representation of music. It contains pitches and other parameters that indicate correct pronunciation.
The spectrogram is then transformed into speech by Deepmind’s neural network Wavenet. The software specializes in producing sounds using such graphs and has been used since October 2017 for the voice of the Google Assistant.
Tacotron 2 was trained with 24 hours of audio material from a professional speaker. The results sound so authentic that they are no longer distinguishable from real voice recordings.
Tacotron 2 Emphasizes yes, emotions no
The new Sound AI provides improved intonation for a more natural flow of speech while emphasizing the writing style and position of words in the sentence. For example, if there is a question mark at the end of the sentence, the vote goes right up.
But there are still deficits: According to the researchers. The AI voice is not yet able to express emotions in the sound of the voice and has problems with individual foreign words. Also, the voice output does not work in real time yet. At least music, however, seems to be the basis for sensual AI experiences, as described by the sci-fi film “Her.”
Further sound examples are available on this website; the complete publication is available here.