Masked Langauge Model

Serengeti

<p align="center"> <br> <img src="./serengeti_logo.png"/> <br> <p>

</p>

<img src="./serengati_languages.jpg" width="50%" height="50%" align="right"> <div style='text-align: justify;'> Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. <br><br> To date, only ~31 out of 2,000 African languages are covered in existing language models. We ameliorate this limitation by developing <b>SERENGETI</b>, a set of massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages. <br><br> <b>SERENGETI</b> outperforms other models on 11 datasets across eights tasks, achieving 82.27 average F<sub>1</sub>-score. We also perform analyses of errors from our models, which allows us to investigate the influence of language genealogy and linguistic similarity when the models are applied under zero-shot settings. We will publicly release our models for research. </div>

3. How to use Serengeti model

Below is an example for using Serengeti predict masked tokens.

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/serengeti", use_auth_token="XXX")

model = AutoModelForMaskedLM.from_pretrained("UBC-NLP/serengeti", use_auth_token="XXX")
from transformers import pipeline

classifier = pipeline("fill-mask", model=model, tokenizer=tokenizer)
classifier("ẹ jọwọ , ẹ <mask> mi") #Yoruba
[{'score': 0.07887924462556839,
  'token': 8418,
  'token_str': 'ọmọ',
  'sequence': 'ẹ jọwọ, ẹ ọmọ mi'},
 {'score': 0.04658124968409538,
  'token': 156595,
  'token_str': 'fẹ́ràn',
  'sequence': 'ẹ jọwọ, ẹ fẹ́ràn mi'},
 {'score': 0.029315846040844917,
  'token': 204050,
  'token_str': 'gbàgbé',
  'sequence': 'ẹ jọwọ, ẹ gbàgbé mi'},
 {'score': 0.02790883742272854,
  'token': 10730,
  'token_str': 'kọ',
  'sequence': 'ẹ jọwọ, ẹ kọ mi'},
 {'score': 0.022904086858034134,
  'token': 115382,
  'token_str': 'bẹ̀rù',
  'sequence': 'ẹ jọwọ, ẹ bẹ̀rù mi'}]

For the more details please read this notebook Open In Colab

4. Ethics

Serengeti aligns with Afrocentric NLP where the needs of African people is put into consideration when developing technology. We believe Serengeti will not only be useful to speakers of the languages supported, but also researchers of African languages such as anthropologists and linguists. We discuss below some use cases for Serengeti and offer a number of broad impacts.

Supported languages

Please refer to suported-languages

Citation

If you use the pre-trained model (Serengeti) for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (to be updated):

@inproceedings{adebara-etal-2023-serengeti,
    title = "{SERENGETI}: Massively Multilingual Language Models for {A}frica",
    author = "Adebara, Ife  and
      Elmadany, AbdelRahim  and
      Abdul-Mageed, Muhammad  and
      Alcoba Inciarte, Alcides",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.97",
    doi = "10.18653/v1/2023.findings-acl.97",
    pages = "1498--1537",
}

Acknowledgments

We gratefully acknowledges support from Canada Research Chairs (CRC), the Natural Sciences and Engineering Research Council of Canada (NSERC; RGPIN-2018-04267), the Social Sciences and Humanities Research Council of Canada (SSHRC; 435-2018-0576; 895-2020-1004; 895-2021-1008), Canadian Foundation for Innovation (CFI; 37771), Digital Research Alliance of Canada, UBC ARC-Sockeye, Advanced Micro Devices, Inc. (AMD), and Google. Any opinions, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of CRC, NSERC, SSHRC, CFI, the Alliance, AMD, Google, or UBC ARC-Sockeye.