Google DeepMind announces 'AI that generates music perfectly suited to videos'



Following the AI that generates images and text, video generation AI is also rapidly advancing, but videos generated by AI to date have only been silent or with human-added sound. On June 17, 2024, Google DeepMind announced 'video-to-audio (V2A),' which generates music and sound according to the atmosphere and movement of the video.

Generating audio for video - Google DeepMind

https://deepmind.google/discover/blog/generating-audio-for-video/

The V2A system announced by Google DeepMind is a technology that can generate dramatic background music, realistic sound effects, character dialogue, and more when combined with the video generation AI ' Veo .'

For example, the following movie is accompanied by music and sounds with the prompt 'Cinematic, thriller, horror film, music, tension, ambience, footsteps on concrete'.

V2A Horror - YouTube


In a scene where a person is walking from the front to the back, unsettling background music and the sound of crunching footsteps can be heard.



When the scene changed and a figure appeared, there was a heavy 'buzzing' sound.



There are many other samples available. The audio prompt for the following movie is 'Cute baby dinosaur chirps, jungle ambience, egg cracking.'

V2A Dinosaur - YouTube


'Jellyfish pulsating under water, marine life, ocean'

V2A Jellyfish - YouTube


'A drummer on a stage at a concert surrounded by flashing lights and a cheering crowd'

V2A Drums - YouTube


'Cars skidding, car engine throttling, angelic electronic music'

V2A Cars - YouTube


'A slow mellow harmonica plays as the sun goes down on the prairie.'

V2A Cowboy - YouTube


'Wolf howling at the moon'

V2A Wolf - YouTube


The V2A system first encodes the input video, then uses a diffusion model to generate repeating sounds from random noise, and once a realistic voice is generated that matches the video and prompts, it decodes it and synthesizes the audio data with the video.



Because the V2A system can understand video, inputting text prompts is optional. For example, the guitar sound in the video below was synthesized without any prompting.

V2A Guitar - YouTube


Although it still tends to be unnatural, some degree of lip syncing is possible. For example, the lines spoken by the character in the video below were synthesized from the script: 'Transcript: 'this turkey looks amazing, I'm so hungry.'

V2A Claymation family - YouTube


Because audio can be added not only to videos generated by Veo, Google DeepMind said, 'It can also generate sound for a variety of existing footage, such as archival material or silent films, opening up a wider range of creative opportunities.'

in Software,   Video, Posted by log1l_ks