hmByT5 - Preliminary Language Models

Preliminary Historic Multilingual and Monolingual ByT5 Models. Following languages are currently covered:

More details can be found in our GitHub repository.

Pretraining

We pretrain hmByT5 on a v3-32 TPU Pod. Details about the training can be found here.

Evaluation on Downstream Tasks (NER)

We evaluated the hmByT5 model that was pretrained on English AjMC corpus for 200k steps:

Hyper-param Configuration Run 1 Run 2 Run 3 Run 4 Run 5 Avg.
wsFalse-bs4-e10-lr0.00016-poolingfirst 83.80 84.78 83.74 83.35 84.37 84.01 ± 0.50
wsFalse-bs4-e10-lr0.00015-poolingfirst 84.67 82.69 83.92 84.53 82.90 83.74 ± 0.82
wsFalse-bs8-e10-lr0.00016-poolingfirst 82.12 83.82 83.37 83.00 83.70 83.20 ± 0.61
wsFalse-bs8-e10-lr0.00015-poolingfirst 83.45 82.83 84.15 81.76 83.78 83.19 ± 0.84

It turns out, that the results are not on-par with current SOTA on the English AjMC corpus, see a comparison here. Thus, we continue experiments with the Hugging Face Transformers JAX/FLAX implementation to pretrain ByT5 models on TPU.

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️