RoBERTa for Multilabel Language Segmentation

Training

RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

Implemented heuristic algorithm for multilingual training data creation with generation of target masks- https://github.com/n1kstep/lang-classifier

The metrics obtained from validation on the another part of dataset (~1k samples per language).

Validation Loss	Precision	Recall	F1-Score	Accuracy
0.029172	0.919623	0.933586	0.926552	0.991883