NVIDIA is developing an AI that can speak expressively like a real human being



At INTERSPEECH 2021 , a technology conference related to speech processing, NVIDIA, a semiconductor maker and a developer of artificial intelligence (AI) technology, is developing an AI that can speak naturally at a level that can be mistaken for human voice. Announced.

NVIDIA Shares Speech Synthesis Research at Interspeech | NVIDIA Blog
https://blogs.nvidia.com/blog/2021/08/31/conversational-ai-research-speech-synthesis-interspeech/


Synthetic voice has evolved from mechanical voice in automatic guidance service voices and guidance in old car navigation systems to something quite human and sophisticated in virtual assistants installed in smartphones and smart speakers.

Still, there is still a big difference between real human conversation voice and synthetic voice, and it is easy to tell whether it is a real human voice or an AI synthetic voice. According to NVIDIA, it is difficult for AI to completely imitate the complex rhythms and intonations contained in the human voice.

Humans have traditionally been the narrators in the footage of NVIDIA introducing new products and technologies. This is because conventional speech synthesis models have limited control over the tempo and pitch of the speech that can be synthesized, making it impossible to speak in a way that stimulates the viewer's emotions like a human narrator.



However, since NVIDIA's speech synthesis research team developed the text-to-speech synthesis technology '(PDF file)

RAD-TTS ', NVIDIA's speech synthesis technology has made great progress. And NVIDIA's open-source conversational AI toolkit for researchers working on automatic speech recognition, natural language processing, and text-to-speech synthesis is NVIDIA NeMo . By treating the human voice as an instrument, you can finely control the pitch, duration, and strength of the synthesized voice at the frame level.

You can see what the voice synthesis by 'NVIDIA NeMo' actually looks like by watching the following movie.

All the Feels: NVIDIA Shares Expressive Speech Synthesis Research at Interspeech --YouTube


NVIDIA-affiliated video creator David Weissman reads the lines into the microphone.



Engineers use AI models to convert voice.



Weissman's narration has been translated into the voice of a female narrator. Normally, the machine voice has a peculiar intonation and often feels strange, but the voice converted to this female narrator does not feel strange at all and is reproduced very smoothly. It is also possible to adjust the voice synthesized on the AI side to emphasize a specific word, or change the speed of narration to match it with the video. It seems that the voice synthesized or converted by this NVIDIA NeMo is basically used for the narration of the movie released recently by NVIDIA.



Speech synthesis is useful not only for narration but also for music production. For example, when creating a song, the chorus part must record and superimpose the singing voices of multiple people. However, by using synthetic speech, it is possible to record chorus parts without collecting multiple people.



The AI models included in NVIDIA NeMo have been trained on NVIDIA DGX systems with tens of thousands of hours of audio data and run on the Tensor core of NVIDIA GPUs. In addition, NVIDIA NeMo also provides a model learned with Mozilla Common Voice, which is a dataset containing about 14,000 hours of voice data in 76 languages. 'With the world's largest open voice dataset, we aim to democratize voice technology,' commented NVIDIA.

The source of NVIDIA NeMo is available on GitHub.

GitHub --NVIDIA / NeMo: NeMo: a toolkit for conversational AI
https://github.com/NVIDIA/NeMo/tree/main/docs

in Software,   Video, Posted by log1i_yk