Model Description
This model is a finetuned version of the DistilBERT base multilingual model modified for token classification where the tokens are ASCII characters and the labels are Vietnamese characters. The code for building and training this model can be found here.
The model is trained on the Vietnamese wikipedia data here.
We encourage potential users of this model to check out the BERT base multilingual model card to learn more about usage, limitations and potential biases.
- Developed by: Daniel Saelid, Sachin Kumar, Yulia Tsvetkov
- Model type: Transformer-based language model
- Related Models: DistilBERT base multilingual model, BERT base multilingual model
- Resources for more information:
Direct Use
You can use the raw model to restore diacritics for ASCII-ified Vietnamese text.
Evaluation
The model developers report the following accuracies for restoring diacritics on ASCII-ified Vietnamese text. All metrics only consider syllables that contain just alphabetic characters.
| Character Accuracy | Syllable Accuracy | Sentence Accuracy | | 98.75 | 96.10 | 50.26 |