It turns out that the image generation AI ``Stable Diffusion'' can actually achieve quite excellent image compression



Stable Diffusion, which was released to the public in August 2022, is an AI that automatically generates images according to the words you enter. Software engineer Matthew Bühlmann explains how to use such Stable Diffusion not only as an image generation AI but also as a powerful lossy image compression codec.

Stable Diffusion-based Image Compression | by Matthias Bühlmann | Sep, 2022 | Medium

https://matthias-buehlmann.medium.com/stable-diffusion-based-image-compression-6f1f0a399202

Actually all the following images are images compressed to 512 × 512 pixels, taken in the city of San Francisco. The first is JPEG format, the second is WebP format, and the third is Stable Diffusion compression.





Candy store showcase. The first is JPEG format, the second is WebP format, and the third is Stable Diffusion compression.





Alpaca face. The first is JPEG format, the second is WebP format, and the third is Stable Diffusion compression.





In both cases, we can see that compression using Stable Diffusion has the least noise and the smallest file size after compression.

Variational Auto Encoder (VAE) encodes and decodes images from image space to some latent space in generating images with Stable Diffusion. A latent space representation is a low-resolution (64x64 pixels), high-precision (4x32 bits) representation of any source image (512x512 pixels in 3x8 or 4x8 bits).

For example, the following 512 × 512 pixel 24 bpp image ......



Encoding to a 64 × 64 pixel 128 bpp image with VAE looks like this.



And the following is the encoded image decoded into a 512 x 512 pixel, 24 bpp image. At first glance, it looks like it has just returned to its original state, but in fact there is a slight loss, such as the letters written on the alpaca's collar becoming somewhat difficult to read.



In order to use Stable Diffusion as an image compression codec, Mr. Bühlmann examined how efficiently the latent image representation generated by VAE can be compressed. Downsampling the latent image or applying an existing lossy image compression method resulted in a significant deterioration in the reconstructed image. On the other hand, it seems that VAE decoding was found to be very robust against quantization of latent images.

By quantizing the latent image to 8 bits with VAE, the data size of the image representation is 64 x 64 x 4 x 8 bits = 131,072 bits = about 16.4 kB. Using 256 paletted representations using Floyd-Steinberg dithering , the data size is 64 x 64 x 8 bits + 256 x 4 x 8 bits = 40,960 bits = about 5.12 kB.

Floyd-Steinberg dithering introduces noise into the latent palette, which distorts the decoded result. However, since Stable Diffusion is a mechanism that removes noise from the latent image, the decoding result will be close to the original image by performing several iterations.

However, Stable Diffusion has the property that it affects the image content itself rather than the image quality. Since version 1.4 of Stable Diffusion cannot hold small characters and faces in the latent space, even if the image quality seems to be beautiful at first glance, it may actually be a completely different image. Bühlmann says that if this problem is solved in version 1.5, Stable Diffusion's usefulness as an image compression algorithm will increase further.

in Software, Posted by log1i_yk