RoBERTa for Single Language Classification
Training
RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).
| data source | language |
|---|---|
| open_subtitles | ka, he, en, de |
| oscar | be, kk, az, hu |
| tatoeba | ru, uk |
Validation
The metrics obtained from validation on the another part of dataset (~1k samples per language).
| index | class | f1-score | precision | recall | support |
|---|---|---|---|---|---|
| 0 | az | 0.998 | 0.997 | 1.0 | 997 |
| 1 | be | 0.996 | 0.998 | 0.994 | 1004 |
| 2 | de | 0.976 | 0.966 | 0.987 | 979 |
| 3 | en | 0.976 | 0.986 | 0.967 | 1020 |
| 4 | he | 1.0 | 1.0 | 0.999 | 1001 |
| 5 | hy | 0.994 | 0.991 | 0.998 | 993 |
| 6 | ka | 0.999 | 0.999 | 0.999 | 1000 |
| 7 | kk | 0.996 | 0.998 | 0.993 | 1005 |
| 8 | uk | 0.982 | 0.997 | 0.968 | 1030 |
| 9 | ru | 0.982 | 0.968 | 0.997 | 971 |
| 10 | macro_avg | 0.99 | 0.99 | 0.99 | 10000 |
| 11 | weighted avg | 0.99 | 0.99 | 0.99 | 10000 |