Introducing WALT, a diffusion model that generates photorealistic videos from simple text
A research team from Stanford University and Google announced WALT , a diffusion model that generates photorealistic videos from text. Many videos actually generated using 'WALT' have been released.
WALTpdf
https://walt-video-diffusion.github.io/assets/WALTpdf
Photorealistic Video Generation with Diffusion Models
https://walt-video-diffusion.github.io/
'WALT' is a video generation AI based on the deep learning model Transformer announced by Google and others. Mr. Agrim Gupta of the research team mentioned the mechanism of WALT in a post on X (formerly Twitter).
We introduce WALT, a diffusion model for photorealistic video generation. Our model is a transformer trained on image and video generation in a shared latent space. ???????? pic.twitter.com/uJKMtMsumv
— Agrim Gupta (@agrimgupta92) December 11, 2023
WALT first uses a causal 3D encoder to compress images and videos in a shared latent space .
2/ website: https://t.co/atH5wzRudu
— Agrim Gupta (@agrimgupta92) December 11, 2023
Our approach has two key design decisions. First, we use a causal encoder to compress images and videos in a shared latent space. pic.twitter.com/5YlLU2NaHa
The team then uses a windowed attention architecture tailored for spatial and temporal co-generative modeling in latent space to improve memory and training efficiency.
3/ Second, for memory and training efficiency, we use a window spatial attention based transformer architecture for joint and temporal generative modeling in latent space. pic.twitter.com/0uxVdRqlPL
— Agrim Gupta (@agrimgupta92) December 11, 2023
This allows us to generate photorealistic and temporally consistent motion from natural language prompts.
4/ Our model can generate photorealistic, temporally consistent motion from natural language prompts. pic.twitter.com/emH6nb8gkm
— Agrim Gupta (@agrimgupta92) December 11, 2023
In fact, the research team has published many examples of videos generated using WALT. Below is an example.
Video of ``Raccoon wearing a black jacket dancing slowly in front of the pyramid'' made with the AI model ``WALT'' that generates videos from text - YouTube
Video of ``Aerial photography of a beautiful castle surrounded by water'' made with the AI model ``WALT'' that generates videos from text - YouTube
Video of ``Dog wearing VR goggles at dusk'' made with the AI model ``WALT'' that generates videos from text - YouTube
Video of ``astronaut riding a horse'' made with the AI model ``WALT'' that generates videos from text - YouTube
Video of ``Elephant walking on the beach wearing a birthday hat'' made with the AI model ``WALT'' that generates videos from text - YouTube
Other videos published by the research team can be viewed on the following webpage.
Photorealistic Video Generation with Diffusion Models
https://walt-video-diffusion.github.io/samples.html
Related Posts: