Sep 14, 2022 15:00:00

An easy-to-understand illustration of ``how to draw a picture'' that you can understand if you know how to master the image generation AI ``Stable Diffusion''

The image generation AI `` Stable Diffusion '', which was released to the public free of charge in August 2022, can generate arbitrary character strings and images by anyone with a CPU equipped with an NVIDIA GPU or an online execution environment such as Google Colaboratory. can be generated. AI Pub, which explains AI on Twitter, explains how such Stable Diffusion generates images.

// Stable Diffusion, Explained //

You've seen the Stable Diffusion AI art all over Twitter.

But how does Stable Diffusion _work_?

A thread explaining diffusion models, latent space representations, and context injection:

1
/15pic.twitter.com/VX9UVmUaKJ
— AI Pub (@ai__pub) August 21, 2022

You can roughly understand how Stable Diffusion generates images by looking at the GIF animation that you can see by clicking the image below.

In the first place, 'Diffusion' in Stable Diffusion means 'diffusion'. This diffusion is the process of repeatedly adding random small noises to the image, proceeding from left to right in the image below. Stable Diffusion also does the reverse, right-to-left, that is, transforming noise into an image.

First, a one-tweet summary of diffusion models (DMs).

Diffusion is the process of adding small, random noise to an image, repeatedly.

Diffusion models reverse this process, turning noise into images, bit-by-bit.

Photo credit: @AssemblyAI

2/15pic.twitter.com /Jk34utlZxE
— AI Pub (@ai__pub) August 21, 2022

Then, a trained neural network is in charge of the process of converting this noise into an image. What the neural network learns is the function f(x,t), which slightly denoises x to produce what it looks like t-1 times.

How do DMs turn noise into images?

By training a neural network to do so gradually.

With the sequence of noised images = x_1, x_2, ... x_T,

The neural net learns a function f(x,t) that denoises x 'a little bit', producing what x would look like at time step t-1.

3/15pic.twitter.com/ QZJX89X6Rk
— AI Pub (@ai__pub) August 21, 2022

To turn pure noise into a clean image, you can apply this function many times. Stable Diffusion processing is like f(f(f(f(....f(N, T), T-1), T-2)..., 2, 1), a nested state of functions where N is pure noise and T is the number of steps.

To turn pure noise into an HD image, just apply f several times!

The output of a diffusion model really is just
f(f(f(f(....f(N, T), T-1), T-2)..., 2, 1)
where N is pure noise, and T is the number of diffusion steps.

The neural net f is typically implemented as a U-net.

4/15pic.twitter.com /lxYvucaCGt
— AI Pub (@ai__pub) August 21, 2022

Of course, doing a series of tasks with 512 x 512 pixels is very computationally expensive and costly.

The key idea behind Stable Diffusion:

Training and computing a diffusion model on large 512 x 512 images is _incredibly_ slow and expensive.

Instead, let's do the computation on _embeddings_ of images, rather than on images themselves.

5/15 pic.twitter.com/h8qRyxeqY1
— AI Pub (@ai__pub) August 21, 2022

So instead of using the actual pixel space, we use the lower dimensional latent space to reduce this computational burden. Specifically, we use an encoder to compress the image X to a latent spatial representation z(x), and perform the Diffusion Process and Denoising U-Net on z(x) instead of x. flow. In the figure below, ε is the encoder and D is the decoder.

So, stable diffusion works in two steps.

Step 1: Use an encoder to compress an image 'x' into a lower-dimensional, latent-space representation 'z(x)'

Step 2: run diffusion and denoising on z(x), rather than x.

Diagram below!

6/15pic.twitter.com /SPHCyZDOSD
— AI Pub (@ai__pub) August 21, 2022

The following article also explains how the neural network understands the image.

How does the neural network understand images - GIGAZINE

Stable Diffusion can also input strings (prompts) as function variables. By entering the prompt, the direction of denoising is determined to some extent.

But where does the text prompt come in?

I lied! SD does NOT learn a function f(x,t) to denoise xa 'little bit' back in time.

It actually learns a function f(x, t, y), with y the 'context' to guide the denoising of x.

Below, y is the image label 'arctic fox'.

8/15 pic.twitter.com/z4WVWJ8NVu
— AI Pub (@ai__pub) August 21, 2022

Stable Diffusion allows the 'context' of the prompt to intervene with simple concatenation during the denoising process and cross-attention just before decoding.

But how does SD process context?

The 'context' y, alongside the time step t, can be injected into the latent space representation z(x) either by:

1) Simple concatenation
2) Cross-attention

Stable diffusion uses both.

10/15 pic.twitter.com/WaDmy0RyIB
— AI Pub (@ai__pub) August 21, 2022

Another major feature of Stable Diffusion is that it can handle images in addition to strings as context. Stable Diffusion simultaneously performs image restoration and image synthesis from image data.

The cool part not talked about on Twitter: the context mechanism is incredibly flexible.

Instead of y = an image label,

Let y = a masked image, or y = a scene segmentation.

SD trained on this different data, can now do image inpainting and semantic image synthesis!

11/15 pic.twitter.com/wzCF8OOV0p
— AI Pub (@ai__pub) August 21, 2022

In addition, the following article also summarizes the mechanism of Stable Diffusion in detail.

What is the mechanism of image generation AI 'Stable Diffusion' that raises issues such as artist rights violations and pornography generation? -GIGAZINE