Model Description

This model is a finetuned version of the DistilBERT base multilingual model modified for token classification where the tokens are ASCII characters and the labels are Vietnamese characters. The code for building and training this model can be found here.

The model is trained on the Vietnamese wikipedia data here.

We encourage potential users of this model to check out the BERT base multilingual model card to learn more about usage, limitations and potential biases.

Direct Use

You can use the raw model to restore diacritics for ASCII-ified Vietnamese text.

Evaluation

The model developers report the following accuracies for restoring diacritics on ASCII-ified Vietnamese text. All metrics only consider syllables that contain just alphabetic characters.

| Character Accuracy | Syllable Accuracy | Sentence Accuracy | | 98.75 | 96.10 | 50.26 |