Image generation AI 'Stable Diffusion' developer announces 'DeepFloyd IF' that can generate images from natural sentences



Stability AI, which developed the AI ``

Stable Diffusion '' that generates images from sentences (prompts), has released a new image generation AI `` DeepFloyd IF ''. Performance such as processing to reflect correct characters in images has been improved.

DeepFloyd IF — DeepFloyd
https://deepfloyd.ai/deepfloyd-if

Stability AI announces 'DeepFloyd IF', a high-performance text-to-image conversion model that incorporates a large-scale language model
https://ja.stability.ai/blog/deepfloyd-if-text-to-image-model

Since the DeepFloyd IF demo page was published, I actually tried it. First, enter the prompt and click 'Generate'. This time, I entered 'a koala wearing clothes with the word 'good night' written on its abdomen' in Japanese as the prompt, and left the Negative Prompt blank.



Then, an image that seems to be unrelated was generated. If you enter any prompt in Japanese, an image like this will appear, so it is better not to enter in Japanese at the time of article creation.



The result of reconsidering and entering the prompt in English is like this. Four image candidates are displayed, all of which are output at low resolution and must be upscaled next.



Select one image you like and click 'Upscale'.



Then, the upscaled image is displayed like this.



The image below clearly shows the generation flowchart of DeepFloyd IF. Inputted prompts are transformed into qualitative text representations through the frozen T5-XXL language model, and further transformed into 64×64 images by three base models: IF-I 400M, IF-I 900M, and IF-I 4.3B. increase.



In the second stage, we apply the '

Efficient U-Net ' trained with IF-II 450M or IF-II 1.2B to the output of the base model. One of them is to upscale a 64x64 image to a 256x256 image. The third stage applies the model that was not applied in the second stage to produce a sharp 1024x1024 image.

DeepFloyd IF was trained on the LAION-A dataset. LAION-A was derived from the LAION-5B dataset through similarity hash-based deduplication, cleaning, and other modifications to the original dataset, using DeepFloyd's custom filters to remove watermarks, NSFW, and other Inappropriate content has been removed.

DeepFloyd AI is good at ``reflecting characters'', which other models are not good at, and can correctly reflect characters in the image. You can check the lyrics of the song generated by DeepFloyd AI, reflected in the image, and animated from the following. In multiple scenes, you can see that the text exactly as the lyrics are reflected in the image.

Lyric video, but it's AI Generated (The Smiths - There Is a Light That Never Goes Out)-YouTube


Based on the same prompt, images were generated with Stable Diffusion 2.1 and DeepFloyd AI, and the images below are compared side by side.



This time a comparison image with

Imagen .



Muse



eDiff-I



Party



This is a comparison image with

DALL E2 .



in Software,   Art, Posted by log1p_kr