stable-diffusion stable-diffusion-diffusers text-to-image en english

English Stable Diffusion Pokemon Model Card

<!-- rinna -->

Stable-Diffusion-Pokemon-en is a English-specific latent text-to-image diffusion model capable of generating Pokemon images given any text input.

This model was trained by using a powerful text-to-image model, diffusers For more information about our training method, see train_text_to_image.py.

<!-- Open In Colab -->

Model Details

Examples

Firstly, install our package as follows. This package is modified 🤗's Diffusers library to run English Stable Diffusion.

pip install diffusers==0.4.1

Run this command to log in with your HF Hub token if you haven't before:

huggingface-cli login

Running the pipeline with the LMSDiscreteScheduler scheduler:

import torch
import pandas as pd

from torch import autocast
from diffusers import LMSDiscreteScheduler, StableDiffusionPipeline

scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012,
     beta_schedule="scaled_linear", num_train_timesteps=1000)

#pretrained_model_name_or_path = "en_model_26000"
pretrained_model_name_or_path = "svjack/Stable-Diffusion-Pokemon-en"
pipe = StableDiffusionPipeline.from_pretrained(pretrained_model_name_or_path,
                                                           scheduler=scheduler, use_auth_token=True)

pipe = pipe.to("cuda")

disable safety_checker
pipe.safety_checker = lambda images, clip_input: (images, False)

imgs = pipe("A cartoon character with a potted plant on his head",
                    num_inference_steps = 100
)
image = imgs.images[0]
    
image.save("output.png")

Generator Results comparison

https://github.com/svjack/Stable-Diffusion-Pokemon

0 1 2

<!-- <table><caption>Images</caption> <thead> <tr> <th>Prompt</th> <th colspan="1">English</th> </tr> </thead> <tbody> <tr> <td>A cartoon character with a potted plant on his head<br/><br/>鉢植えの植物を頭に載せた漫画のキャラクター<br/><br/>一个头上戴着盆栽的卡通人物</td> <td><img src="https://github.com/svjack/Stable-Diffusion-Pokemon/blob/main/imgs/en_bird.jpg" alt="Girl in a jacket" width="500" height="500"></td> </tr> <tr> <td>cartoon bird<br/><br/>漫画の鳥<br/><br/>卡通鸟</td> <td><img src="en_bird.jpg" alt="Girl in a jacket" width="500" height="500"></td> </tr> </tbody> <tfoot> <tr> <td>blue dragon illustration<br/><br/>ブルードラゴンのイラスト<br/><br/>蓝色的龙图</td> <td><img src="en_blue_dragon.jpg" alt="Girl in a jacket" width="500" height="500"></td> </tr> </tfoot> </table> -->

<!-- Note: JapaneseStableDiffusionPipeline is almost same as diffusers' StableDiffusionPipeline but added some lines to initialize our models properly.

Misuse, Malicious Use, and Out-of-Scope Use

Note: This section is taken from the DALLE-MINI model card, but applies in the same way to Stable Diffusion v1.

The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.

Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

Misuse and Malicious Use

Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:

Limitations and Bias

Limitations

Bias

While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases. Japanese Stable Diffusion was trained on Japanese datasets including LAION-5B with Japanese captions, which consists of images that are primarily limited to Japanese descriptions. Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for. This affects the overall output of the model. Further, the ability of the model to generate content with non-Japanese prompts is significantly worse than with Japanese-language prompts.

Safety Module

The intended use of this model is with the Safety Checker in Diffusers. This checker works by checking model outputs against known hard-coded NSFW concepts. The concepts are intentionally hidden to reduce the likelihood of reverse-engineering this filter. Specifically, the checker compares the class probability of harmful concepts in the embedding space of the CLIPTextModel after generation of the images. The concepts are passed into the model with the generated image and compared to a hand-engineered weight for each NSFW concept.

Training

Training Data We used the following dataset for training the model:

Training Procedure Japanese Stable Diffusion has the same architecture as Stable Diffusion and was trained by using Stable Diffusion. Because Stable Diffusion was trained on English dataset and the CLIP tokenizer is basically for English, we had 2 stages to transfer to a language-specific model, inspired by PITI.

  1. Train a Japanese-specific text encoder with our Japanese tokenizer from scratch with the latent diffusion model fixed. This stage is expected to map Japanese captions to Stable Diffusion's latent space.
  2. Fine-tune the text encoder and the latent diffusion model jointly. This stage is expected to generate Japanese-style images more.

-->