stable-diffusion text-to-image

Model Card for flex-diffusion-2-1

<!-- Provide a quick summary of what the model is/does. [Optional] --> stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios.

TLDR:

There are 2 models in this repo:

For usage, see - How to Get Started with the Model

It aims to solve the following issues:

  1. Generated images looks like they are cropped from a larger image.

  2. Generating non-square images creates weird results, due to the model being trained on square images. Examples:

resolution model stable diffusion flex diffusion
576x1024 (9:16) v2-1 img img
576x1024 (9:16) v2-base img img
1024x576 (16:9) v2-1 img img
1024x576 (16:9) v2-base img img

Limitations:

  1. It's trained on a small dataset, so it's improvements may be limited.
  2. For each aspect ratio, it's trained on only a fixed resolution. So it may not be able to generate images of different resolutions. For 1:1 aspect ratio, it's fine-tuned at 512x512, although flex-diffusion-2-1 was last finetuned at 768x768.

Potential improvements:

  1. Train on a larger dataset.
  2. Train on different resolutions even for the same aspect ratio.
  3. Train on specific aspect ratios, instead of a range of aspect ratios.

Table of Contents

Model Details

Model Description

<!-- Provide a longer summary of what this model is/does. --> stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for dynamic aspect ratios.

finetuned resolutions:

width height aspect ratio
0 512 1024 1:2
1 576 1024 9:16
2 576 960 3:5
3 640 1024 5:8
4 512 768 2:3
5 640 896 5:7
6 576 768 3:4
7 512 640 4:5
8 640 768 5:6
9 640 704 10:11
10 512 512 1:1
11 704 640 11:10
12 768 640 6:5
13 640 512 5:4
14 768 576 4:3
15 896 640 7:5
16 768 512 3:2
17 1024 640 8:5
18 960 576 5:3
19 1024 576 16:9
20 1024 512 2:1

Uses

Training Details

Training Data

aspect_ratio counts
0 1:1 154727
1 3:2 119615
2 2:3 61197
3 4:3 52276
4 16:9 38862
5 400:267 21893
6 3:4 16893
7 8:5 16258
8 4:5 15684
9 6:5 12228
10 1000:667 12097
11 2:1 11006
12 800:533 10259
13 5:4 9753
14 500:333 9700
15 250:167 9114
16 5:3 8460
17 200:133 7832
18 1024:683 7176
19 11:10 6470
width height aspect ratio
0 512 1024 1:2
1 576 1024 9:16
2 576 960 3:5
3 640 1024 5:8
4 512 768 2:3
5 640 896 5:7
6 576 768 3:4
7 512 640 4:5
8 640 768 5:6
9 640 704 10:11
10 512 512 1:1
11 704 640 11:10
12 768 640 6:5
13 640 512 5:4
14 768 576 4:3
15 896 640 7:5
16 768 512 3:2
17 1024 640 8:5
18 960 576 5:3
19 1024 576 16:9
20 1024 512 2:1

Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

Preprocessing

  1. download files with url & caption from https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
  1. use img2dataset to convert to webdataset
    • https://github.com/rom1504/img2dataset
    • I put train-00000-of-00007-29aec9150af50f9f.parquet in a folder called first-file
    • the output folder is /mnt/aesthetics6plus, change this to your own folder
echo INPUT_FOLDER=first-file
echo OUTPUT_FOLDER=/mnt/aesthetics6plus
img2dataset --url_list $INPUT_FOLDER --input_format "parquet"\
        --url_col "URL" --caption_col "TEXT" --output_format webdataset\
        --output_folder $OUTPUT_FOLDER --processes_count 3 --thread_count 6 --image_size 1024 --resize_only_if_bigger --resize_mode=keep_ratio_largest \
        --save_additional_columns '["WIDTH","HEIGHT","punsafe","similarity"]' --enable_wandb True
  1. The data-loading code will do preprocessing on the fly, so no need to do anything else. But it's not optimized for speed, the GPU utilization fluctuates between 80% and 100%. And it's not written for multi-GPU training, so use it with caution. The code will do the following:

Speeds, Sizes, Times

Results

More information needed

Model Card Authors

Jonathan Chang

How to Get Started with the Model

Use the code below to get started with the model.

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler, UNet2DConditionModel

def use_DPM_solver(pipe):
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
    return pipe

pipe = StableDiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-2-1",
    unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-1/unet", torch_dtype=torch.float16),
    torch_dtype=torch.float16,
    )
# for v2-base, use the following line instead
#pipe = StableDiffusionPipeline.from_pretrained(
#  "stabilityai/stable-diffusion-2-base",
#    unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-base/unet", torch_dtype=torch.float16),
#    torch_dtype=torch.float16)
pipe = use_DPM_solver(pipe).to("cuda")
pipe = pipe.to("cuda")

prompt = "a professional photograph of an astronaut riding a horse"
image = pipe(prompt).images[0]

image.save("astronaut_rides_horse.png")