text-to-video text-to-image jax-diffusers-event art

Make-A-Video SD JAX Model Card

A latent diffusion model for text-to-video synthesis.

Try it with an interactive demo on HuggingFace spaces.

Training code, PyTorch and FLAX implementation are available here: https://github.com/lopho/makeavid-sd-tpu

This model extends an inpainting latent-diffusion image generation model (Stable Diffusion v1.5 Inpaint) with temporal convolution and temporal self-attention ported from Make-A-Video PyTorch

It has then been fine tuned for ~150k steps on a dataset of 10,000 videos themed around dance. Then for an additional ~50k steps with extra data of generic videos mixed into the original set.

This model used weights pretrained by lxj616 on 286 timelapse video clips for initialization.

Table of Contents

Model Details

Uses

Limitations

Training

Training Data

Training Procedure

Hyperparameters

Trainig statistics are available at Weights and Biases.

Acknowledgements

Citation

@misc{TempoFunk2023,
      author = {Lopho, Carlos Chavez},
      title = {TempoFunk: Extending latent diffusion image models to Video},
      url = {https://github.com/lopho/makeavid-sd-tpu},
      month = {5},
      year = {2023}
}

This model card was written by: Lopho, Chavinlo, Julian Herrera and is based on the DALL-E Mini model card.