Dec 12, 2023 16:00:00

Introducing WALT, a diffusion model that generates photorealistic videos from simple text

A research team from Stanford University and Google announced WALT , a diffusion model that generates photorealistic videos from text. Many videos actually generated using 'WALT' have been released.

WALTpdf
https://walt-video-diffusion.github.io/assets/WALTpdf

Photorealistic Video Generation with Diffusion Models
https://walt-video-diffusion.github.io/

'WALT' is a video generation AI based on the deep learning model Transformer announced by Google and others. Mr. Agrim Gupta of the research team mentioned the mechanism of WALT in a post on X (formerly Twitter).

We introduce WALT, a diffusion model for photorealistic video generation. Our model is a transformer trained on image and video generation in a shared latent space. ???????? pic.twitter.com/uJKMtMsumv
— Agrim Gupta (@agrimgupta92) December 11, 2023

WALT first uses a causal 3D encoder to compress images and videos in a shared latent space .

2/ website: https://t.co/atH5wzRudu

Our approach has two key design decisions. First, we use a causal encoder to compress images and videos in a shared latent space. pic.twitter.com/5YlLU2NaHa
— Agrim Gupta (@agrimgupta92) December 11, 2023

The team then uses a windowed attention architecture tailored for spatial and temporal co-generative modeling in latent space to improve memory and training efficiency.

3/ Second, for memory and training efficiency, we use a window spatial attention based transformer architecture for joint and temporal generative modeling in latent space. pic.twitter.com/0uxVdRqlPL
— Agrim Gupta (@agrimgupta92) December 11, 2023

This allows us to generate photorealistic and temporally consistent motion from natural language prompts.

4/ Our model can generate photorealistic, temporally consistent motion from natural language prompts. pic.twitter.com/emH6nb8gkm
— Agrim Gupta (@agrimgupta92) December 11, 2023

In fact, the research team has published many examples of videos generated using WALT. Below is an example.

Video of ``Raccoon wearing a black jacket dancing slowly in front of the pyramid'' made with the AI model ``WALT'' that generates videos from text - YouTube

Video of ``Aerial photography of a beautiful castle surrounded by water'' made with the AI model ``WALT'' that generates videos from text - YouTube

Video of ``Dog wearing VR goggles at dusk'' made with the AI model ``WALT'' that generates videos from text - YouTube

Video of ``astronaut riding a horse'' made with the AI model ``WALT'' that generates videos from text - YouTube

Video of ``Elephant walking on the beach wearing a birthday hat'' made with the AI model ``WALT'' that generates videos from text - YouTube

Other videos published by the research team can be viewed on the following webpage.

Photorealistic Video Generation with Diffusion Models
https://walt-video-diffusion.github.io/samples.html