oyo-bert-base

OYO-BERT (or Oyo-dialect of Yoruba BERT) was created by pre-training a BERT model with token dropping on Yoruba language texts for about 100K steps. It was trained using BERT-base architecture with Tensorflow Model Garden

Pre-training corpus

A mix of WURA, Wikipedia and MT560 Yoruba data

How to use

You can use this model with Transformers pipeline for masked token prediction.

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='Davlan/oyo-bert-base')
>>> unmasker("Ọjọ kẹsan-an, [MASK] Kẹjọ ni wọn ri oku Baba")
[{'score': 0.9981744289398193, 'token': 3785, 'token_str': 'osu', 'sequence': 'ojo kesan - an, osu kejo ni won ri oku baba'}, {'score': 0.0015279919607564807, 'token': 3355, 'token_str': 'ojo', 'sequence': 'ojo kesan - an, ojo kejo ni won ri oku baba'}, {'score': 0.0001734074903652072, 'token': 11780, 'token_str': 'osun', 'sequence': 'ojo kesan - an, osun kejo ni won ri oku baba'}, {'score': 9.066923666978255e-05, 'token': 21579, 'token_str': 'oṣu', 'sequence': 'ojo kesan - an, oṣu kejo ni won ri oku baba'}, {'score': 1.816015355871059e-05, 'token': 3387, 'token_str': 'odun', 'sequence': 'ojo kesan - an, odun kejo ni won ri oku baba'}]

Acknowledgment

We thank @stefan-it for providing the pre-processing and pre-training scripts. Finally, we would like to thank Google Cloud for providing us access to TPU v3-8 through the free cloud credits. Model trained using flax, before converted to pytorch.

BibTeX entry and citation info.