stable-diffusion stable-diffusion-diffusers text-to-image

BK-SDM-2M Model Card

BK-SDM-{Base-2M, Small-2M, Tiny-2M} are pretrained with 10× more data (2.3M LAION image-text pairs) compared to our previous release.

Examples with 🤗Diffusers library.

An inference code with the default PNDM scheduler and 50 denoising steps is as follows.

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("nota-ai/bk-sdm-tiny-2m", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a black vase holding a bouquet of roses"
image = pipe(prompt).images[0]  
    
image.save("example.png")

Compression Method

Adhering to the U-Net architecture and distillation pretraining of BK-SDM, the difference in BK-SDM-2M is a 10× increase in the number of training pairs.

Experimental Results

The following table shows the zero-shot results on 30K samples from the MS-COCO validation split. After generating 512×512 images with the PNDM scheduler and 25 denoising steps, we downsampled them to 256×256 for evaluating generation scores.

Model FID↓ IS↑ CLIP Score↑<br>(ViT-g/14) # Params,<br>U-Net # Params,<br>Whole SDM
Stable Diffusion v1.4 13.05 36.76 0.2958 0.86B 1.04B
BK-SDM-Base (Ours) 15.76 33.79 0.2878 0.58B 0.76B
BK-SDM-Base-2M (Ours) 14.81 34.17 0.2883 0.58B 0.76B
BK-SDM-Small (Ours) 16.98 31.68 0.2677 0.49B 0.66B
BK-SDM-Small-2M (Ours) 17.05 33.10 0.2734 0.49B 0.66B
BK-SDM-Tiny (Ours) 17.12 30.09 0.2653 0.33B 0.50B
BK-SDM-Tiny-2M (Ours) 17.53 31.32 0.2690 0.33B 0.50B

Effect of Different Data Sizes for Training BK-SDM-Small

Increasing the number of training pairs improves the IS and CLIP scores over training progress. The MS-COCO 256×256 30K benchmark was used for evaluation.

<center> <img alt="Training progress with different data sizes" img src="https://netspresso-research-code-release.s3.us-east-2.amazonaws.com/assets-bk-sdm/fig_iter_data_size.png" width="100%"> </center>

Furthermore, with the growth in data volume, visual results become more favorable (e.g., better image-text alignment and clear distinction among objects).

<center> <img alt="Visual results with different data sizes" img src="https://netspresso-research-code-release.s3.us-east-2.amazonaws.com/assets-bk-sdm/fig_results_data_size.png" width="100%"> </center>

Additional Visual Examples

<center> <img alt="additional visual examples" img src="https://netspresso-research-code-release.s3.us-east-2.amazonaws.com/assets-bk-sdm/fig_results_models_2m.png" width="100%"> </center>

Uses

Follow the usage guidelines of Stable Diffusion v1.

Acknowledgments

Citation

@article{kim2023architectural,
  title={On Architectural Compression of Text-to-Image Diffusion Models},
  author={Kim, Bo-Kyeong and Song, Hyoung-Kyu and Castells, Thibault and Choi, Shinkook},
  journal={arXiv preprint arXiv:2305.15798},
  year={2023},
  url={https://arxiv.org/abs/2305.15798}
}
@article{Kim_2023_ICMLW,
  title={BK-SDM: Architecturally Compressed Stable Diffusion for Efficient Text-to-Image Generation},
  author={Kim, Bo-Kyeong and Song, Hyoung-Kyu and Castells, Thibault and Choi, Shinkook},
  journal={ICML Workshop on Efficient Systems for Foundation Models (ES-FoMo)},
  year={2023},
  url={https://openreview.net/forum?id=bOVydU0XKC}
}

This model card was written by Bo-Kyeong Kim and is based on the Stable Diffusion v1 model card.