stable-diffusion dreamfusion text2mesh

Stable-Dreamfusion

A pytorch implementation of the text-to-3D model Dreamfusion, powered by the Stable Diffusion text-to-2D model.

The original paper's project page: DreamFusion: Text-to-3D using 2D Diffusion.

Colab notebook for usage: Open In Colab

Examples generated from text prompt a high quality photo of a pineapple viewed with the GUI in real time:

https://user-images.githubusercontent.com/25863658/194241493-f3e68f78-aefe-479e-a4a8-001424a61b37.mp4

Gallery | Update Logs

Important Notice

This project is a work-in-progress, and contains lots of differences from the paper. Also, many features are still not implemented now. The current generation quality cannot match the results from the original paper, and many prompts still fail badly!

Notable differences from the paper

TODOs

Install

git clone https://github.com/ashawkey/stable-dreamfusion.git
cd stable-dreamfusion

Important: To download the Stable Diffusion model checkpoint, you should provide your access token. You could choose either of the following ways:

Install with pip

pip install -r requirements.txt

# (optional) install nvdiffrast for exporting textured mesh (--save_mesh)
pip install git+https://github.com/NVlabs/nvdiffrast/

# (optional) install the tcnn backbone if using --tcnn
pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

# (optional) install CLIP guidance for the dreamfield setting
pip install git+https://github.com/openai/CLIP.git

Build extension (optional)

By default, we use load to build the extension at runtime. We also provide the setup.py to build each extension:

# install all extension modules
bash scripts/install_ext.sh

# if you want to install manually, here is an example:
pip install ./raymarching # install to python path (you still need the raymarching/ folder, since this only installs the built extension.)

Tested environments

Usage

First time running will take some time to compile the CUDA extensions.

### stable-dreamfusion setting
## train with text prompt (with the default settings)
# `-O` equals `--cuda_ray --fp16 --dir_text`
# `--cuda_ray` enables instant-ngp-like occupancy grid based acceleration.
# `--fp16` enables half-precision training.
# `--dir_text` enables view-dependent prompting.
python main.py --text "a hamburger" --workspace trial -O

# if the above command fails to generate things (learns an empty scene), maybe try:
# 1. disable random lambertian shading, simply use albedo as color:
python main.py --text "a hamburger" --workspace trial -O --albedo_iters 10000 # i.e., set --albedo_iters >= --iters, which is default to 10000
# 2. use a smaller density regularization weight:
python main.py --text "a hamburger" --workspace trial -O --lambda_entropy 1e-5

# you can also train in a GUI to visualize the training progress:
python main.py --text "a hamburger" --workspace trial -O --gui

# A Gradio GUI is also possible (with less options):
python gradio_app.py # open in web browser

## after the training is finished:
# test (exporting 360 video)
python main.py --workspace trial -O --test
# also save a mesh (with obj, mtl, and png texture)
python main.py --workspace trial -O --test --save_mesh
# test with a GUI (free view control!)
python main.py --workspace trial -O --test --gui

### dreamfields (CLIP) setting
python main.py --text "a hamburger" --workspace trial_clip -O --guidance clip
python main.py --text "a hamburger" --workspace trial_clip -O --test --gui --guidance clip

Code organization & Advanced tips

This is a simple description of the most important implementation details. If you are interested in improving this repo, this might be a starting point. Any contribution would be greatly appreciated!

# 1. we need to interpolate the NeRF rendering to 512x512, to feed it to SD's VAE.
pred_rgb_512 = F.interpolate(pred_rgb, (512, 512), mode='bilinear', align_corners=False)
# 2. image (512x512) --- VAE --> latents (64x64), this is SD's difference from Imagen.
latents = self.encode_imgs(pred_rgb_512)
... # timestep sampling, noise adding and UNet noise predicting
# 3. the SDS loss, since UNet part is ignored and cannot simply audodiff, we manually set the grad for latents.
w = self.alphas[t] ** 0.5 * (1 - self.alphas[t])
grad = w * (noise_pred - noise)
latents.backward(gradient=grad, retain_graph=True)

Acknowledgement