hmByT5 - Preliminary Language Models

Preliminary Historic Multilingual and Monolingual ByT5 Models. Following languages are currently covered:

More details can be found in our GitHub repository.

Pretraining

We use the official JAX/FLAX example in Hugging Face Transformers to pretrain a ByT5 model on a single v3-8 TPU. Details about the training can be found here.

This model was trained with mean_noise_span_length=20.

Evaluation on Downstream Tasks (NER)

We evaluated the hmByT5 model on ICDAR Europeana dataset:

Configuration Run 1 Run 2 Run 3 Run 4 Run 5 Avg.
wsFalse-bs4-e10-lr0.00015-poolingfirst 86.61 85.88 87.65 87.93 88.01 87.22 ± 0.83
wsFalse-bs8-e10-lr0.00015-poolingfirst 87.88 87.56 85.62 86.52 87.03 86.92 ± 0.8
wsFalse-bs4-e10-lr0.00016-poolingfirst 86.17 85.87 87.77 86.58 87.96 86.87 ± 0.85
wsFalse-bs8-e10-lr0.00016-poolingfirst 87.67 86.02 85.66 87 85.99 86.47 ± 0.75

The results show no performance improvement of the model trained with mean_noise_span_length=3, that achieved 87.90 ± 0.71.

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️