vision image-classification

TFaugvit

TFAugViT model is the tensorflow implementation of the AugViT: Augmented Shortcuts for Vision Transformers by Yehui Tang, Kai Han, Chang Xu, An Xiao, Yiping Deng, Chao Xu and Yunhe Wang, and first released in this repository.

Model description

Aug-ViT inserts additional paths with learnable parameters in parallel on the original shortcuts for alleviating the feature collapse. The block-circulant projection is used to implement augmented shortcut, which brings negligible increase of computational cost.

Intended uses & limitations

This model can be used for image classification tasks and easily be fine-tuned to suite your purpose of use.

How to use

Here is how to use this model to classify an image into one of the 1,000 ImageNet classes:

from transformers import TFAutoModelForImageClassification
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

model = TFAutoModelForImageClassification.from_pretrained("tensorgirl/TFaugvit",trust_remote_code=True)

outputs = model({'pixel_values':image})


# model predicts one of the 1000 ImageNet classes
predicted_class_idx = outputs.argmax(-1)

Training data

The TFAugViT model is trained on ImageNet-1k, a dataset consisting of 1 million images and 1,000 classes.

Training procedure

Due to the use of einops library you cannot use the model,fit() directly on this model, you will have to either write a custom training loop by passing the inputs as shown above or you can wrap the model in a functional model of keras and specify the batch_size beforehand. If you want to train the model on some other data then either resize the images to 224x224 or change the model config image_size to suit your requirements.

Training hyperparameters

The following hyperparameters were used during training:

Evaluation results

Model ImageNet top-1 accuracy # params Resolution
Aug-ViT-S 81 22.2 M 224x224
Aug-ViT-B 82.4 86.5 M 224x224
Aug-ViT-B (Upsampled) 84.2 86.5 M 384x384

Framework versions

BibTeX entry and citation info

@inproceedings{aug-vit tf,
title = {AugViT: Augmented Shortcuts for Vision Transformers},
author = {Yehui Tang, Kai Han, Chang Xu, An Xiao, Yiping Deng, Chao Xu and Yunhe Wang},
year = {2021},
URL = {https://arxiv.org/abs/2106.15941}
}