roberta icelandic norwegian faroese danish swedish masked-lm pytorch

ScandiBERT-no-faroese

This is a version of the ScandiBERT model trained without any Faroese data and a different subword tokenizer.

The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks.

Language Data Size
Icelandic See IceBERT paper 16 GB
Danish Danish Gigaword Corpus (incl Twitter) 4,7 GB
Norwegian NCC corpus 42 GB
Swedish Swedish Gigaword Corpus 3,4 GB

If you find this model useful, please cite

@inproceedings{snaebjarnarson-etal-2023-transfer,
    title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese",
    author = "Snæbjarnarson, Vésteinn  and
      Simonsen, Annika  and
      Glavaš, Goran  and
      Vulić, Ivan",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = "may 22--24",
    year = "2023",
    address = "Tórshavn, Faroe Islands",
    publisher = {Link{\"o}ping University Electronic Press, Sweden},
}