vit-base-vocalsound-logmel

This model is a fine-tuned version of google/vit-base-patch16-224 on VocalSound dataset. It achieves the following results on the evaluation set:

accuracy: 88.8
precision (micro): 91.3
recall (micro): 87.1
f1 score (micro): 89.1
f1 score (macro): 89.1

Training and evaluation data

Training: VocalSound training split (#samples = 15570)

Evaluation: VocalSound test split(#samples = 3594)

Training hyperparameters

The following hyperparameters were used during training:

optimizer: AdamW
- weight_decay: 0
- learning_rate: 5e-5
batch_size: 32
training_precision: float32

Preprocessing

Differently from vit-base-vocalsound, the log-melspectrogram is used(log was applied as an addition) and the preprocessor normalization step uses VocalSound statistics(i.e. mean and std) instead of the default IMAGENET ones.

Framework versions

Transformers 4.27.4
TensorFlow 2.12.0
Tokenizers 0.13.3