image2video generation diffusion model 图到视频 图生视频 图片生成视频 生成

I2VGen-XL高清图像生成视频大模型

本项目I2VGen-XL旨在解决根据输入图像生成高清视频任务。I2VGen-XL由达摩院研发的高清视频生成基础模型之一,其核心部分包含两个阶段,分别解决语义一致性和清晰度的问题,参数量共计约37亿,模型经过在大规模视频和图像数据混合预训练,并在少量精品数据上微调得到,该数据分布广泛、类别多样化,模型对不同的数据均有良好的泛化性。项目相比于现有视频生成模型,I2VGen-XL在清晰度、质感、语义、时序连续性等方面均具有明显的优势。

此外,I2VGen-XL的许多设计理念和设计细节(比如核心的UNet部分)都继承于我们已经公开的工作VideoComposer,您可以参考我们的VideoComposer和本项目ModelScope的了解详细细节。

The I2VGen-XL project aims to address the task of HD video generation based on input images. I2VGen-XL is one of the HQ video generation base models developed by DAMO Academy. Its core components consist of two stages, each addressing the issues of semantic consistency and video quality. The total number of parameters is approximately 3.7 billion. The model has been pre-trained on a large-scale mixture of video and image data and fine-tuned on a small amount of high-quality data. This data distribution is extensive and diverse, and the model demonstrates good generalization to different types of data. Compared to existing video generation models, the I2VGen-XL project has significant advantages in terms of quality, texture, semantics, and temporal continuity.

Additionally, many design concepts and details of I2VGen-XL (such as the core UNet) are inherited from our publicly available work, VideoComposer. For detailed information, please refer to our VideoComposer and the Github code repository for this ModelScope project. <center> <p align="center"> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/image/Fig_twostage.png"/><br/> Fig.1 I2VGen-XL <p> </center>

<font color="#dd0000">体验地址(Project experience address):</font> <font color="#0000ff">https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary</font>

模型介绍 (Introduction)

如图Fig.2所示,I2VGen-XL是一种基于隐空间的视频扩散模型(VLDM),其通过我们专门设计的时空UNet(ST-UNet)在隐空间中进行时空建模并通过解码器重建出最终视频(具体模型结构可以参考VideoComposer)。为能够生成720P视频,我们将I2VGen-XL分为两个阶段,第一阶段是在低分辨率条件下保证语义一致性,第二阶是利用新的VLDM进行去噪以提高视频分辨率以及同时提升时间和空间上的一致性。通过在模型、数据和训练上的联合优化,I2VGen-XL主要具有以下几个特点:

以下为生成的部分案例:

As shown in Fig.2, I2VGen-XL is a video latent diffusion model. It utilizes our designed ST-UNet ((for model details, please refer to VideoComposer)) to perform spatio-temporal modeling in the latent space and reconstruct the generated video through a decoder. In order to generate 720P videos, we divide I2VGen-XL into two stages. The first stage ensures semantic consistency with low resolutions, while the second stage utilizes the new VLDM to denoise and improve video resolution, as well as enhance temporal and spatial consistency. Through joint optimization of the model, data, and training, I2VGen-XL has the following characteristics.

Below are some examples generated by the model:

<center> <p align="center"> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/image/fig1_overview.jpg"/> <br/> Fig.2 VLDM <p> </center>

为方便展示,本页面展示为低分辨率GIF格式,但是GIF会下降视频质量,720P的视频效果可以参下面对应的视频链接

For display purposes, this page shows low-resolution GIF format. However, GIF format may reduce video quality. For specific effects, please refer to the video link below.

<center> <table><center> <tr> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/dragon2_rank_02-00-0021-001024.gif"/> </center></td> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/laoshu_rank_02-01-0810-001024.gif"/> </center></td> </tr> <tr> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424319402790.mp4">HQ Video</a> </center></td> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423628044217.mp4">HQ Video</a> </center></td> </tr> <tr> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/ac10af0b1c524b778aff60be5b7ecc4f_2_02_00_0065_rank_02-00-1256-001024.gif"/> </center></td> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/ast_rank_02-00-0773-001024.gif"/> </center></td> </tr> <tr> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423965629168.mp4">HQ Video</a> </center></td> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423969933887.mp4">HQ Video</a> </center></td> </tr> <tr> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/e3733444344741f1970cf2e92e617182_1_02_00_0199.gif"/> </center></td> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/b307dad96c3d440e80514b1b3f3be5fd_1_rank_02-00-0068-000000.gif"/> </center></td> </tr> <tr> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423966661082.mp4">HQ Video</a> </center></td> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424613631285.mp4">HQ Video</a> </center></td> </tr> <tr> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/robot1_rank_02-01-0009-009999.gif"/> </center></td> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/d82ed4ad01034243ba88eaf9311c1edf_3_02_01_0193.gif"/> </center></td> </tr> <tr> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424612211915.mp4">HQ Video</a> </center></td> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424613123188.mp4">HQ Video</a> </center></td> </tr> <tr> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/airship_0_rank_02-00-000000_rank_02-00-0653-001024.gif"/> </center></td> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/airship_1_rank_02-01-000000_rank_02-00-1428-001024.gif"/> </center></td> </tr> <tr> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424616459162.mp4">HQ Video</a> </center></td> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424614735831.mp4">HQ Video</a> </center></td> </tr> <tr> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/0ba38f2f287f446dac8de87291073e0c_3_rank_02-01-0118-000000.gif"/> </center></td> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/03b401c825a2479eaf7b1b3252683a4b_3_02_00_0110_rank_02-00-1009-001024.gif"/> </center></td> </tr> <tr> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424617591002.mp4">HQ Video</a> </center></td> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423631572030.mp4">HQ Video</a> </center></td> </tr> <tr> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/3e89356e6bd3470aaf3900b1b34c3ec2_0_rank_02-01-0126-000000.gif"/> </center></td> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/6fd21439fce644afa3a2e9b057956d0f_0000000_rank_02-01-0159-001024.gif"/> </center></td> </tr> <tr> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423629092176.mp4">HQ Video</a> </center></td> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424616071017.mp4">HQ Video</a> </center></td> </tr> <tr> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/293fdf76aa404971b1fbb66baf9cbaac_1_02_00_0123_rank_02-00-0288-001024.gif"/> </center></td> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/426a7bee22034a88872dc8277ddbbf06_0_02_01_0023_rank_02-01-1090-001024.gif"/> </center></td> </tr> <tr> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424317682762.mp4">HQ Video</a> </center></td> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424313138794.mp4">HQ Video</a> </center></td> </tr> <tr> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/a15bb09862b74b3c983a54b379912f81_0_02_00_0055_rank_02-01-0443-001024.gif"/> </center></td> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/7716d91802614bf9a99174c05bd08f32_3_02_01_0157_rank_02-01-1199-001024.gif"/> </center></td> </tr> <tr> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423631376023.mp4">HQ Video</a> </center></td> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424616459198.mp4">HQ Video</a> </center></td> </tr> <tr> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/indian_rank_02-00-0800-001024.gif"/> </center></td> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/bike_rank_02-01-0007-001024.gif"/> </center></td> </tr> <tr> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424314646086.mp4">HQ Video</a> </center></td> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424610479196.mp4">HQ Video</a> </center></td> </tr> <tr> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/panda_rank_02-01-0007-009999.gif"/> </center></td> <td ><center> <img src="https://huggingface.co/damo-vilab/MS-Image2Video/resolve/main/assets/gif/bf19a66dca0a47799923c47249982ffd_0000000_rank_02-01-0960-001024.gif"/> </center></td> </tr> <tr> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424321438157.mp4">HQ Video</a> </center></td> <td ><center> <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424614283086.mp4">HQ Video</a> </center></td> </tr> </table> </center>

[<font color="#dd0000">2023.08.25 更新</font>] ModelScope发布1.8.4版本,I2VGen-XL模型更新到模型参数文件 v1.1.0;

依赖项 (Dependency)

首先你需要确定你的系统安装了ffmpeg命令,如果没有,可以通过以下命令来安装:

First, you need to ensure that your system has installed the ffmpeg command. If it is not installed, you can install it using the following command:

sudo apt-get update && apt-get install ffmpeg libsm6 libxext6  -y

其次,本I2VGen-XL项目适配ModelScope代码库,以下是本项目需要安装的部分依赖项。

The I2VGen-XL project is compatible with the ModelScope codebase, and the following are some of the dependencies that need to be installed for this project.

pip install modelscope==1.8.4
pip install xformers==0.0.20
pip install torch==2.0.1
pip install open_clip_torch>=2.0.2
pip install opencv-python-headless
pip install opencv-python 
pip install einops>=0.4
pip install rotary-embedding-torch
pip install fairscale 
pip install scipy
pip install imageio
pip install pytorch-lightning
pip install torchsde

快速使用 (Inference)

关于更多的尝试,请关注我们将公开的技术报告和开源代码。

For more experiments, please stay tuned for our upcoming technical report and open-source code release.

代码范例 (Code example)

from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys

pipe = pipeline(task='image-to-video', model='damo/Image-to-Video', model_revision='v1.1.0')

# IMG_PATH: your image path (url or local file)
output_video_path = pipe(IMG_PATH, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
print(output_video_path)

如果想生成超分视频的话, 示例见下:

If you want to generate high-resolution video, please use the following code:

from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys

# if you only have one GPU, please make it's GPU memory bigger than 50G, or you can use two GPUs, and set them by device
pipe1 = pipeline(task='image-to-video', model='damo/Image-to-Video', model_revision='v1.1.0', device='cuda:0')
pipe2 = pipeline(task='video-to-video', model='damo/Video-to-Video', model_revision='v1.1.0', device='cuda:0')

# image to video
output_video_path = pipe1("test.jpg", output_video='./i2v_output.mp4')[OutputKeys.OUTPUT_VIDEO]

# video resolution
p_input = {'video_path': output_video_path}
new_output_video_path = pipe2(p_input, output_video='./v2v_output.mp4')[OutputKeys.OUTPUT_VIDEO]

更多超分细节, 请访问 <a href="https://modelscope.cn/models/damo/Video-to-Video/summary">Video-to-Video</a>。 我们也提供了用户接口,请移步<a href="https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary">I2VGen-XL-Demo</a>。

Please visit <a href="https://modelscope.cn/models/damo/Video-to-Video/summary">Video-to-Video</a> for more details. We also provide user interface:<a href="https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary">I2VGen-XL-Demo</a>.

模型局限 (Limitation)

I2VGen-XL项目的模型在处理以下情况会存在局限性:

此外,我们研究也发现,生成的视频空间上的质量和时序上的变化速度在一定程度上存在互斥现象,在本项目我们选择了其折中的模型,兼顾两者间的平衡。

The model of the I2VGen-XL project still have some following limitations:

In addition, our research has also found that there is a trade-off between the quality of the generated video in spatial and temporal changes. In this project, we have chosen a model that strikes a balance between the two.

如果您正在尝试使用我们的模型,我们建议您首先在第一阶段中得到语义符合预期的视频后(离线运行的时候可以修改configuration.json文件中的Seed生成不同视频),再尝试第二阶段的视频修正(因为该过程比较耗时),这样可以提高您的使用效率,也更容易得到更好的结果。

If you are trying to use our model, we suggest that you first obtain semantic-expected videos in the first stage (you can modify the Seed in the configuration.json file when running offline to generate different videos). Then, you can try video refining in the second stage (as this process takes more time). This will improve your efficiency and make it easier to achieve better results.

训练数据介绍 (Training Data)

我们训练数据主要来源来源广泛,具备以下几个属性:

Our training data mainly comes from various sources and has the following attributes:

更强更灵活的视频生成模型会持续发布,及其背后技术报告正在撰写中,欢迎及时关注。

More powerful models will continue to be released, and the technical report behind them are currently being written. Please stay tuned for updates and timely information.

相关论文以及引用信息 (Reference)

@article{videocomposer2023,
  title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
  author={Wang, Xiang* and Yuan, Hangjie* and Zhang, Shiwei* and Chen, Dayou* and Wang, Jiuniu and Zhang, Yingya and Shen, Yujun and Zhao, Deli and Zhou, Jingren},
  journal={arXiv preprint arXiv:2306.02018},
  year={2023}
}

@inproceedings{videofusion2023,   
  title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},   
  author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},   
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},   
  year={2023}   
}

使用协议 (License Agreement)

我们的代码和模型权重仅可用于个人/学术研究,暂不支持商用。

Our code and model weights are only available for personal/academic research use and are currently not supported for commercial use.

联系我们 (Contact Us)

如果你想联系我们的算法/产品同学, 或者想加入我们的算法团队(实习/正式), 欢迎发邮件至: yingya.zyy@alibaba-inc.com

If you would like to contact us, or join our team (internship/formal), please feel free to email us at yingya.zyy@alibaba-inc.com.