Microsoft announces speech synthesis AI 'NaturalSpeech 2' that can reproduce conversations and singing voices from samples of just a few seconds



On April 18, 2023, a research team led by Kai Shen of

Microsoft Research Asia and Microsoft Azure announced `` NaturalSpeech 2 '', a low-quality speech synthesis system (TTS) using a diffusion model . NaturalSpeech 2 makes it possible to faithfully simulate not only human voices but also singing voices by using short audio samples of a few seconds.

[2304.09116] NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
https://doi.org/10.48550/arXiv.2304.09116



Natural Speech 2

https://speechresearch.github.io/naturalspeech2/



Conventional TTS systems have achieved high speech quality on single-speaker recording datasets, but these datasets are incapable of capturing the diversity of human identities and styles such as accents. . Also, when scaling to large, multi-person datasets, current TTS systems typically quantize speech into discrete tokens and use language models to generate tokens one-by-one, resulting in instability. Problems with prosody, skipping and repeating words, and poor voice quality are problems.

However, in 'NaturalSpeech 2' developed by Shen et al.'s research team, by using a latent diffusion model, it has high expressive power and reproducibility, and creates a text-to-speech model that faithfully reproduces the voice of the sample. is possible.

NaturalSpeech 2 reconstructs the waveform of the input speech using

an audio codec using a neural network and a residual vector quantizer, and uses a diffusion model to generate latent vectors conditioned on text input. It is said that there is. To enhance zero-shot learning , NaturalSpeech 2 also includes a speech prompting mechanism to facilitate in-context learning of speech pitch predictors and diffusion models. In addition, NaturalSpeech 2 is said to surpass previous TTS systems in terms of prosody and voice quality.



A paper published by Shen et al.'s research team actually presents an example of speech synthesis using NaturalSpeech 2. The following was learned using

LibriSpeech , ``Indeed, there were only one or two strangers who could be in sisters without producing the same results. admitted among the sisters without producing the same result.)”. First enter a few seconds of unrelated audio as a prompt.


The voice of 'Ground Truth' is a text read aloud by the same person as the sample voice, and this is the target 'correct voice'.


“Baseline” below is a synthesized voice created with a conventional AI model. There is no intonation, it is somewhat unnatural and mechanical, but it outputs a voice close to humans.


The audio output using NaturalSpeech 2 is below. It can be confirmed that the level of voice output is comparable to that of ``human voice'' in terms of breathing and accent.


Below is an example of learning using ' VCTK ' by a research team at the University of Edinburgh. To generate the speech 'We will turn the corner.', we first input an unrelated speech.


Below is the audio for 'Ground Truth'.


The audio output in 'Baseline' is as follows.


The following is the voice of 'Turn the corner' output using NaturalSpeech 2.


The research team is also comparing with the speech synthesis AI model ' VALL-E ' developed by Microsoft. 'Thus did this humane and right-hearted father comfort his unfortunate daughter, and his mother embraced her again, and did all she could to make her feel better.' Minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings.)” below is the audio output by VALL-E.


Below is the audio output using NaturalSpeech 2.


NaturalSpeech 2 can also input and output singing voice. Below is an example of singing with the lyrics 'So listen very carefully.' First, enter irrelevant sentences.


Then, it will be output as a singing voice with intonation and rhythm.


NaturalSpeech 2 can also input from singing voice. In the example below, the same ' BINGO ' is sung.


In the example below, it is shown that even if singing input is performed, singing voice will be output as well.


Microsoft's research team warns that ``NaturalSppech 2 is capable of faithful expression, and there is a risk of being abused by imitation or impersonation of the speaker.'' Also, in order to avoid these ethical and potential problems, the research team ``appeals to developers not to abuse this technology and to develop countermeasure tools to detect voice synthesized by AI. I am,” he claims. ``When developing such AI models, we always adhere to Microsoft's responsible AI principles, '' he said.

The source code of NaturalSpeech 2 is published on GitHub.

GitHub - lucidrains/naturalspeech2-pytorch: Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch
https://github.com/lucidrains/naturalspeech2-pytorch

in Software, Posted by log1r_ut