Make-A-Video SD JAX Model Card

A latent diffusion model for text-to-video synthesis.

Try it with an interactive demo on HuggingFace spaces.

Training code, PyTorch and FLAX implementation are available here: https://github.com/lopho/makeavid-sd-tpu

This model extends an inpainting latent-diffusion image generation model (Stable Diffusion v1.5 Inpaint) with temporal convolution and temporal self-attention ported from Make-A-Video PyTorch

It has then been fine tuned for ~150k steps on a dataset of 10,000 videos themed around dance. Then for an additional ~50k steps with extra data of generic videos mixed into the original set.

This model used weights pretrained by lxj616 on 286 timelapse video clips for initialization.

Model Details
Uses
Limitations
Training
Acknowledgements
Citation

Model Details

Developed by: Lopho, Chavinlo
Model type: Diffusion based text-to-video generation model
Language(s): English
License: (pending) GNU Affero General Public License 3.0
Further resources: Model implementation & training code, Weights & Biases training statistics

Uses

Understanding limitations and biases of generative video models
Development of educational or creative tools
Artistic usage
What ever you want

Limitations

Limited knowledge of temporal concepts not seen during training (see linked datasets)
Emerging flashing lights, most likely due to training on dance videos, which include many scenes with bright, neon and flashing lights
The model has only been trained with English captions and will not perform as well in other languages

Training

Training Data

S(mall)dance: 10,000 video-caption pairs of dancing videos (as encoded image latents, text embeddings and metadata).
small: 7,000 video-caption pairs of general videos (as encoded image latents, text embeddings and metadata).

Training Procedure

From each video sample a random range of 24 frames is selected
Each video latent is encoded into latent representations of the shape 4 x 24 x H/8 x W/8
The latent of the first frame from each video is repeated along the frame dimension as additional guidance (referred to as hint image)
Hint latent and video latent are stacked to produce a shape of 8 x 24 x H/8 x W/8
The last input channel is preserved for masking purposes (not used during training, set to zero)
Text prompts are encoded by the CLIP text encoder
Video latents with added noise and clip encoded text prompts are fed into the UNet to predict the added noise
Loss is the reconstruction objective between the added noise and the predicted noise via mean squared error (mse/l2)

Hyperparameters

Batch size: 1 x 4
Image size: 512 x 512
Frame count: 24
Optimizer: AdamW (beta_1 = 0.9, beta_2 = 0.999, weight decay = 0.02)
Schedule:
- 2 x 10 epochs: LR warmup for 1 epochs then held constant at 5e-5 (10,000 samples per ep)
- 2 x 20 epochs: LR warmup for 1 epochs then held constant at 5e-5 (10,000 samples per ep)
- 1 x 9 epochs: LR warmup for 1 epoch to 5e-5 then cosine annealing to 1e-8
- Additional data mixed in, see Trainig Data
- 1 x 5 epochs: LR warmup for 0.5 epochs to 2.5e-5 then constant (17,000 samples per ep)
- 1 x 5 epochs: LR warmup for 0.5 epochs to 5e-6 then cosine annealing to 2.5e-6 (17,000 samples per ep)
- some restarts were required due to NaNs appearing in the gradient (see training logs)
Total update steps: ~200,000
Hardware: TPUv4-8 (provided by Google Cloud for the HuggingFace JAX/Diffusers Sprint Event)

Trainig statistics are available at Weights and Biases.

Acknowledgements

CompVis for Latent Diffusion Models + Stable Diffusion
Meta AIs Make-A-Video for the research of applying pseudo 3D convolution and attention to existing image models
Phil Wang for the torch implementation of Make-A-Video Pseudo3D convolution and attention
lxj616 for initial proof of feasibility of LDM + Make-A-Video

Citation

@misc{TempoFunk2023,
      author = {Lopho, Carlos Chavez},
      title = {TempoFunk: Extending latent diffusion image models to Video},
      url = {https://github.com/lopho/makeavid-sd-tpu},
      month = {5},
      year = {2023}
}

This model card was written by: Lopho, Chavinlo, Julian Herrera and is based on the DALL-E Mini model card.