NVIDIA is challenging the attempt to 'suppress noise during talking using deep learning'
by bruce mars
Noise cancellation during a call is indispensable for making a comfortable conversation with a phone or a call application, but it does not always necessarily shut out noise perfectly at all times. NVIDIA , one of the leading semiconductor makers in the United States, is engaged in various activities as "to build a practical noise canceling function using deep learning".
AI Powered Speech Enhancement | 2hz
https://2hz.ai/?utm_source= Nvidia% 20blog
Real-Time Noise Suppression Using Deep Learning | NVIDIA Developer Blog
Our daily lives are enveloped with a lot of noise, and when you make a phone call in the town or airport, many people hear a lot of noise and the noise that is ringing near the other party is also heard . NVIDIA decided to use machine learning to alleviate noise during such a call, he said that he has continued many trial and error for two years.
Here is the movie that demonstrated the noise canceling system constructed by deep learning named 2 Hz .
2 Hz demo
First of all, I will record the words in an environment where environmental sounds are hearing offily.
Noises are contained in the recorded sound, and it is hard to hear it.
Here, when noise cancellation is performed by 2 Hz ...
I heard the sound very clearly, I could not hear the noise behind at all.
Subsequently, when the police patrol car played with the sound of the siren ringing sound is reproduced without going through the 2 Hz system ......
The sound of the siren sounds very loud and I must listen to listen to the words.
However, when noise cancellation is done using 2 Hz system ......
It became surprisingly clear that only human words could be heard, even the fact that the siren was ringing behind was no longer understood.
It can be seen that NVIDIA realizes such a high level of noise cancellation function, but it was a very difficult road to construct a system to suppress noise.
First of all, the noise canceling function built by NVIDIA is to suppress noise heard behind the two who are calling, as shown in the figure below when transmitting to the other side. In contrast to this, the active noise canceling (ANC) function installed in earphones and the like. In ANC, the earpieces and headphones sense the noise that is ringing behind themselves, and shut out so that they do not reach their ears. NVIDIA focuses on the function that suppresses noise that can be heard by the other party.
The mechanism of suppressing noise during calling has made considerable progress in recent years, and the noise at the time of calling is much less when compared with the cell phone etc. of 10 years ago. In recent years smartphone has two microphones shown in yellow with the following images, or even more microphones are installed.
One is generally installed in the part where the voice emitted from the mouth often reaches when the user speaks and the other microphones are installed as far as possible from the microphone of the mouth as much as possible, the part where the environmental sound of the back reaches well. While the microphone of the mouth picks up the voice mainly, the back microphone picks up the environmental sound, and the software inside the device counteracts the ambient sound of the surroundings and creates a clean sound.
However, this technology has the disadvantage that "two or more microphones such as a smart watch and a small microphone can not be installed in remote places, or if the position of the user and the device is far apart or shaking, it will not work well." Also, it costs more manufacturing side in terms of mounting multiple microphones. So NVIDIA decided to create a mechanism to implement noise cancellation with a single microphone instead of multiple microphones.
Digital signal processing (DSP) algorithms are often used in noise cancellation to eliminate only the noise that is heard in the background of speech. This DSP algorithm works well when cutting continuous steady noises, but it can not deal with short and quick noise such as baby's crying sound or siren sound. At 2 Hz, we decided to use deep learning in order to cope with the noise which conventional noise cancellation did not work easily.
The method of constructing a noise canceling system using deep learning prepares two types of sound data, noise noise and clean sound, and creates "sound with noise" which is a mixture of them. Then, clean voice data and artificial noise-containing voice data are input to the deep neural network (DNN) and training is done to output clean voice data from the input noise-containing voice data. Then, creating a mask that can extract clean voice data is a noise cancellation system creation method using deep learning. In the 2 Hz project, we developed our own DNN architecture and were able to create masks that can deal with various noises.
Also, the problem when using noise cancellation for voice calls is delay of voice. People can withstand a delay of up to 0.2 seconds when doing a real-time conversation, but when the delay goes beyond that, the timing of speaking to each other will be over, which makes it impossible for a smooth conversation. Three factors such as line, computer and coding affect the delay of the call, and it is usually the condition of the line that affects the delay largest. However, by using noise cancellation using DNN, it is impossible to deny the possibility of occurrence of calculation waiting time enough to hinder the actual call.
In order to support high quality noise canceling calls, it is necessary to improve the performance of the processing computer. However, it is not realistic to install a high-spec computer for noise cancellation for devices such as smartphones used for calls. Therefore, NVIDIA reached the idea of cloudizing the noise canceling mechanism. Since the system of noise canceling is software based, it does not necessarily have to be mounted on the local device itself.
Huge VoIP providers need to process large numbers of concurrent calls. A certain VoIP provider is processing 3000 concurrent calls using the voice code G.711 by one bare metal media server . When noise cancellation is incorporated into the call system with VoIP, if processing on the server side becomes slow and processing becomes heavy, the service quality will be adversely affected, so the user does not want to use noise canceling.
First, 2 Hz attempted to perform processing using CPU, but cost effectiveness result was not obtained. So we experimented with VoIP processing with noise cancellation using GPU's " GTX 1080 Ti ". As a result, we were able to handle 1000 conversations at the same time without optimizing the server, and if we optimize it we can cope with 3000 simultaneous calls I understood. Basic processing such as voice transmission and coding is performed by the CPU, GPU specializes on batch processing of noise canceling against it, and can suppress noise without affecting standard VoIP processing.
The GPU is good at large-scale parallel processing due to the necessity of 3D graphics processing, and it seems that batch processing by deep learning belongs to the field that GPU is good at. Therefore, NVIDIA said that GPU is suitable for batch processing of noise cancellation using deep learning.