AI system 'Imagen' that can automatically generate high-precision images even from outlandish texts

' Imagen ', an AI model that can naturally generate high-resolution images even from outlandish text, has been announced. It seems that increasing the size of the language model will greatly improve both the fidelity of the sample and the integrity of the image and the text.

Imagen: Text-to-Image Diffusion Models

The mechanism of 'Imagen' is as follows. First , embed words by using the text encoder T5-XXL. Then convert the text to a 64x64 pixel image using the diffusion model . Furthermore, by applying this to the diffusion model for high resolution twice, it is possible to finally generate a high resolution image of 1024 × 1024 pixels.

What kind of image can actually be generated from what kind of text Imagen is as follows.

'A raccoon wearing an astronaut's helmet looks at the night view from the window.'

'A couple of robots eating at a fine restaurant with the Eiffel Tower in the background'

'Photo of Corgi wearing sunglasses and a beach hat riding a bicycle in Times Square'

'Google Brain logo and Toronto skyline written in fireworks'

'Corgi in the sushi house'

In addition, there is also an area on Imagen's official page where you can change the generated image by clicking the condition.

Fréchet Inception Distance (FID), an indicator that evaluates the quality of images generated by GAN, is a trained on COCO trained with COCO in an image dataset and images without COCO for training. The following figures compare the scores of the generative model (Not trained on COCO). The lower the FID, the higher the quality of the generated image, and Imagen has the highest score of '7.27'.

In addition, using 'DrawBench', a benchmark using a human evaluator to evaluate the accuracy of the model that converts text to images, Imagen and VQ-GAN + CLIP, Latent Diffusion Models (LDM),

DALL-E 2 The graph below compares the three models. The scores are divided into 'Alignment' and 'Fidelity', with Imagen winning all models on either scale.

A research paper on Imagen has also been published and can be checked from the following.

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
(PDF file)

in Software, Posted by logu_ii