image-classification vision

Compact Convolutional Transformers

Based on the Compact Convolutional Transformers example on keras.io created by Sayak Paul.

Model description

As discussed in the Vision Transformers (ViT) paper, a Transformer-based architecture for vision typically requires a larger dataset than usual, as well as a longer pre-training schedule. ImageNet-1k (which has about a million images) is considered to fall under the medium-sized data regime with respect to ViTs. This is primarily because, unlike CNNs, ViTs (or a typical Transformer-based architecture) do not have well-informed inductive biases (such as convolutions for processing images). This begs the question: can't we combine the benefits of convolution and the benefits of Transformers in a single network architecture? These benefits include parameter-efficiency, and self-attention to process long-range and global dependencies (interactions between different regions in an image).

In Escaping the Big Data Paradigm with Compact Transformers, Hassani et al. present an approach for doing exactly this. They proposed the Compact Convolutional Transformer (CCT) architecture.

Intended uses & limitations

Training and evaluation data

The model is trained using the CIFAR-10 dataset.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

name learning_rate decay beta_1 beta_2 epsilon amsgrad weight_decay exclude_from_weight_decay training_precision
AdamW 0.0010000000474974513 0.0 0.8999999761581421 0.9990000128746033 1e-07 False 9.999999747378752e-05 None float32

Model Plot

<details> <summary>View Model Plot</summary>

Model Image

</details>

<center> Model reproduced by <a href="https://github.com/EdAbati" target="_blank">Edoardo Abati</a> </center>