SwissBERT is a masked language model for processing Switzerland-related text. It has been trained on more than 21 million Swiss news articles retrieved from Swissdox@LiRI.

<img src="https://vamvas.ch/assets/swissbert/swissbert-diagram.png" alt="SwissBERT is a transformer encoder with language adapters in each layer. There is an adapter for each national language of Switzerland. The other parameters in the model are shared among the four languages." width="450" style="max-width: 100%;">

SwissBERT is based on X-MOD, which has been pre-trained with language adapters in 81 languages. For SwissBERT we trained adapters for the national languages of Switzerland – German, French, Italian, and Romansh Grischun. In addition, we used a Switzerland-specific subword vocabulary.

The pre-training code and usage examples are available here. We also release a version that was fine-tuned on named entity recognition (NER): https://huggingface.co/ZurichNLP/swissbert-ner

Languages

SwissBERT contains the following language adapters:

lang_id (Adapter index) Language code Language
0 de_CH Swiss Standard German
1 fr_CH French
2 it_CH Italian
3 rm_CH Romansh Grischun

License

Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

Usage (masked language modeling)

from transformers import pipeline

fill_mask = pipeline(model="ZurichNLP/swissbert")

German example

fill_mask.model.set_default_language("de_CH")
fill_mask("Der schönste Kanton der Schweiz ist <mask>.")

Output:

[{'score': 0.1373230218887329,
  'token': 331,
  'token_str': 'Zürich',
  'sequence': 'Der schönste Kanton der Schweiz ist Zürich.'},
 {'score': 0.08464793860912323,
  'token': 5903,
  'token_str': 'Appenzell',
  'sequence': 'Der schönste Kanton der Schweiz ist Appenzell.'},
 {'score': 0.08250337839126587,
  'token': 10800,
  'token_str': 'Graubünden',
  'sequence': 'Der schönste Kanton der Schweiz ist Graubünden.'},
 ...]

French example

fill_mask.model.set_default_language("fr_CH")
fill_mask("Je m'appelle <mask> Federer.")

Output:

[{'score': 0.9943694472312927,
  'token': 1371,
  'token_str': 'Roger',
  'sequence': "Je m'appelle Roger Federer."},
 ...]

Bias, Risks, and Limitations

Training Details

Environmental Impact

Citation

@article{vamvas-etal-2023-swissbert,
      title={Swiss{BERT}: The Multilingual Language Model for Switzerland}, 
      author={Jannis Vamvas and Johannes Gra\"en and Rico Sennrich},
      year={2023},
      eprint={2303.13310},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2303.13310}
}