ast-finetuned-audioset-10-10-0.4593_ft_ESC-50_aug_0-1

This model is a fine-tuned version of MIT/ast-finetuned-audioset-10-10-0.4593 on a subset of ashraq/esc50 dataset. It achieves the following results on the evaluation set:

Training and evaluation data

Training and evaluation data were augmented with audiomentations GitHub: iver56/audiomentations library and the following augmentation methods have been performed based on previous experiments Elliott et al.: Tiny transformers for audio classification at the edge:

Gain

each audio sample is amplified/attenuated by a random factor between 0.5 and 1.5 with a 0.3 probability

Noise

a random amount of Gaussian noise with a relative amplitude between 0.001 and 0.015 is added to each audio sample with a 0.5 probability

Speed adjust

duration of each audio sample is extended by a random amount between 0.5 and 1.5 with a 0.3 probability

Pitch shift

pitch of each audio sample is shifted by a random amount of semitones selected from the closed interval [-4,4] with a 0.3 probability

Time masking

a random fraction of lenght of each audio sample in the range of (0,0.02] is erased with a 0.3 probability

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Accuracy	Precision	Recall	F1
9.9002	1.0	28	8.5662	0.0	0.0	0.0	0.0
5.7235	2.0	56	4.3990	0.0357	0.0238	0.0357	0.0286
2.4076	3.0	84	2.2972	0.4643	0.7405	0.4643	0.4684
1.4448	4.0	112	1.3975	0.7143	0.7340	0.7143	0.6863
0.8373	5.0	140	1.0468	0.8571	0.8524	0.8571	0.8448
0.7239	6.0	168	0.8518	0.8929	0.9164	0.8929	0.8766
0.6504	7.0	196	0.7391	0.9286	0.9449	0.9286	0.9244
0.535	8.0	224	0.6682	0.9286	0.9449	0.9286	0.9244
0.4237	9.0	252	0.6443	0.9286	0.9449	0.9286	0.9244
0.3709	10.0	280	0.6304	0.9286	0.9449	0.9286	0.9244