What is the fatal flaw found in the image generation AI/Stable Diffusion encoder?



Stable Diffusion, an image generation AI, is an AI called 'latent diffusion model' that can generate highly accurate images just by inputting text. A report has been posted on the online bulletin board site Reddit that the ``VAE'' used in Stable Diffusion has a fatal flaw.

The VAE used for Stable Diffusion 1.x/2.x and other models (KL-F8) has a critical flaw, probably due to bad training, that is holding back all models that use it (almost certainly including DALL-E 3) : StableDiffusion

https://old.reddit.com/r/StableDiffusion/comments/1ag5h5s/the_vae_used_for_stable_diffusion_1x2x_and_other/

Stable Diffusion uses a technology called `` variational autoencoder (VAE)'' that encodes or decodes pixel space images into latent space. A latent space is a space in which the learned image features (data) are gathered, and VAE can capture the learned data features and convert them into latent variables, or generate data from the latent variables.

Stable Diffusion versions 1.x and 2.x use a VAE called 'KL-F8' by default. On the other hand, Stable Diffusion XL , which has further improved performance from Stable Diffusion, uses an encoder independently improved by developer StabilityAI. This time drhead's problem is KL-F8.

For example, encode the image below into latent space.



When encoded into latent space using Stable Diffusion XL's VAE, it looks like this.



Below is the same image encoded with KL-F8. If you look closely, you can see mysterious sunspots appearing in the image.



If you compare the original image (left) encoded into latent space with KL-F8 and then the decoded image (right), you can see that the area where the black dots occur is blurred, and the quality of the image has deteriorated. .



Drhead, the user who posted this issue on Reddit, said that the KL-F8 installed in Stable Diffusion has a feature that says, ``When encoding an image into latent space, a specific part of the image is displayed with the message ``This image is entirely green.'' It points out that the disadvantage is that 'it leaves out comprehensive information such as '.

Essentially, any point on an image in latent space is constructed in relation to nearby points. However, if the overall information is stored in only a small part, it will affect the noise removal during image generation and waste computational resources. Drhead speculates that the black dots that appeared in the image encoded in the latent space above are the points where this overall information is stored.

The KL-F8 issue affects versions 1.x and 2.x of Stable Diffusion, as well as the video generation AI Stable Video Diffusion, OpenAI's DALL-E 3, and other models based on Stable Diffusion. That's what he said. In addition, this problem does not seem to occur with Stable Diffusion XL, which has improved VAE.

Image generation works almost flawlessly even when using the KL-F8. However, drhead said that the problem was that the image raw calculation cost would increase, and argued that building new models using KL-F8 should be stopped.

In addition, drhead claims that the KL-F8 used in Stable Diffusion is a mistake by CompVis , a research group at the University of Munich that developed Stable Diffusion.

in Software, Posted by log1i_yk