Jun 18, 2024 11:10:00

Google DeepMind unveils AI that generates music perfectly suited to videos

Following the advances in image and text generation AI, video generation AI is also rapidly advancing. However, videos generated by AI to date have either been silent or have had sound added by humans. On June 17, 2024, Google DeepMind announced 'video-to-audio (V2A),' a system that generates music and sound to match the atmosphere and movement of video.

Generating audio for video - Google DeepMind

https://deepmind.google/discover/blog/generating-audio-for-video/

The V2A system announced by Google DeepMind is a technology that can generate dramatic background music, realistic sound effects, character dialogue, and more when combined with the video generation AI ' Veo .'

For example, the following movie is accompanied by music and sounds with the prompt 'Cinematic, thriller, horror film, music, tension, ambience, footsteps on concrete.'

V2A Horror - YouTube

In a scene where a character is walking from the front to the back, unsettling background music plays and the sound of crunching footsteps can be heard.

When the scene changed and a figure appeared, a heavy 'buzzing' sound was heard.

There are many other samples available. The audio prompt for the movie below is 'Cute baby dinosaur chirps, jungle ambience, egg cracking.'

V2A Dinosaur - YouTube

'Jellyfish pulsating underwater, marine life, ocean'

V2A Jellyfish - YouTube

'A drummer on a stage at a concert surrounded by flashing lights and a cheering crowd'

V2A Drums - YouTube

'Cars skidding, car engine throttling, angelic electronic music'

V2A Cars - YouTube

'A slow mellow harmonica plays as the sun goes down on the prairie.'

V2A Cowboy - YouTube

'Wolf howling at the moon'

V2A Wolf - YouTube

The V2A system first encodes the input video, then uses a diffusion model to generate repeating sounds from random noise, and once realistic audio is generated that matches the video and prompts, it decodes the audio data and synthesizes it with the video.

Because the V2A system can understand video, inputting text prompts is optional. For example, the guitar sound in the video below was synthesized without any prompts.

V2A Guitar - YouTube

Although it still often sounds unnatural, some degree of lip syncing is possible. For example, the lines spoken by the character in the video below were synthesized from the script: 'Transcript: 'This turkey looks amazing, I'm so hungry.''

V2A Claymation family - YouTube

Since audio can be added not only to videos generated by Veo, Google DeepMind said, 'It can also generate sound for a variety of existing footage, such as archival material or silent films, opening up a wider range of creative opportunities.'

Related Posts:

Jun 18, 2024 11:10:00 in AI, Video, Software, Posted by log1l_ks