pretraining finetuning vision videomae

VideoMAE

image/jpeg

Paper Colab HF Space HF Hub
arXiv Open In Colab HugginFace badge HugginFace badge

Video masked autoencoders (VideoMAE) are seen as data-efficient learners for self-supervised video pre-training (SSVP). Inspiration was drawn from the recent ImageMAE, and customized video tube masking with an extremely high ratio was proposed. Due to this simple design, video reconstruction is made a more challenging self-supervision task, leading to the extraction of more effective video representations during this pre-training process. Some hightlights of VideoMAE:

This is a unofficial Keras reimplementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training model. The official PyTorch implementation can be found here.

Model Zoo

The pre-trained and fine-tuned models are listed in MODEL_ZOO.md. Following are some hightlights.

Kinetics-400

For Kinetrics-400, VideoMAE is trained around 1600 epoch without any extra data. The following checkpoints are available in both tensorflow SavedModel and h5 format.

Backbone #Frame Top-1 Top-5 Params [FT] MB Params [PT] MB) FLOPs
ViT-S 16x5x3 79.0 93.8 22 24 57G
ViT-B 16x5x3 81.5 95.1 87 94 181G
ViT-L 16x5x3 85.2 96.8 304 343 -
ViT-H 16x5x3 86.6 97.1 632 ? -

<sup>?* Official ViT-H backbone of VideoMAE has weight issue in pretrained model, details https://github.com/MCG-NJU/VideoMAE/issues/89.</sup> <sup>The FLOPs of encoder models (FT) are reported only.</sup>

Something-Something V2

For SSv2, VideoMAE is trained around 2400 epoch without any extra data.

Backbone #Frame Top-1 Top-5 Params [FT] MB Params [PT] MB FLOPs
ViT-S 16x2x3 66.8 90.3 22 24 57G
ViT-B 16x2x3 70.8 92.4 86 94 181G

UCF101

For UCF101, VideoMAE is trained around 3200 epoch without any extra data.

Backbone #Frame Top-1 Top-5 Params [FT] MB Params [PT] MB FLOPS
ViT-B 16x5x3 91.3 98.5 86 94 181G