Vi-XLM-RoBERTa base model (uncased)
Epochs 0/40. Running Loss: 6.4104: 1%| | 25149/2442585 [7:18:31<639:12:16]
MODEL IS NOT BEING TRAINED (training on hold for a while)
- Progress: ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 1%
- Stared: 2022-09-26
- Last Updated: 2022-09-28
- Current Checkpoint: checkpoint-x
- Being Trained
<a href=""> <img width="1024px" srcng"> </a>
Logging:
-
<a href="https://wandb.ai/anhdungitvn/test/reports/global_step-22-10-06-08-59-07---VmlldzoyNzQ3NDc5?accessToken=seza8yu7owqp1qaoestn9fozlx4v5zzutg0y5qhj2ofwxmjb0r9wq5ko1iias27o">Global Step</a>
-
<a href="https://wandb.ai/anhdungitvn/test/reports/lr-22-10-06-08-58-48---VmlldzoyNzQ3NDc3?accessToken=nitqvpya34qd2eg50kt65w8jr4e2qexld2u396mj13h6uutal9clbfif0hey4gsq">Learning Rate </a>
-
<a href="https://wandb.ai/anhdungitvn/test/reports/Training-loss-22-10-06-08-57-37---VmlldzoyNzQ3NDcz?accessToken=okepl0ziwxa92r5374qd1ukgotq566r876ac7o0w6acld6ncs6o1rm08dppw31bn">Training Loss</a>
-
<a href="https://wandb.ai/anhdungitvn/test/reports/eval_loss-22-10-06-08-59-27---VmlldzoyNzQ3NDgx?accessToken=aqewlpxjm1h6np55ld3bmph062ixt2fkkdlv46s3yk20p63mh0ibkf4iuvl043je">Eval Loss</a>
-
<a href="https://wandb.ai/anhdungitvn/test/reports/perplexity-22-10-06-08-59-39---VmlldzoyNzQ3NDgy?accessToken=dowvrx240yi4w3b9qr8lq6rtzrnbl5h3ymflpxirg8cepf5cf1e10pcvvy7lcvda">Perplexity</a>
This model is Vietnamese XLM Robert, base, uncased.
Model description
Intended uses & limitations
You can use the raw model for either masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
How to use
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import AutoTokenizer, XLMRobertaForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('anhdungitvn/vi-xlm-roberta-large')
model = XLMRobertaForMaskedLM.from_pretrained('anhdungitvn/vi-xlm-roberta-large')
text = "Câu bằng tiếng Việt."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
and in TensorFlow:
from transformers import AutoTokenizer, XLMRobertaForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('anhdungitvn/vi-xlm-roberta-large')
model = XLMRobertaForMaskedLM.from_pretrained('anhdungitvn/vi-xlm-roberta-large')
text = "Câu bằng tiếng Việt."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
Limitations and bias
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions.
Training data
Vietnamese Wiki (2022, 1GB) + Vietnamese News (Add Reference Here)
Training procedure
Model was pretrained:
- Tokenizer: SentencePiece (BPE) with a vocab of 256000.
- Model type: xlmroberta
- Optimizer: AdamW
- Learning rate: 500µ
Evaluation results
Pretraining metrics and results:
When fine-tuned on downstream tasks, this model achieves the following results:
Task | SC | CN | NC | VSLP_2016_ASC | T | T | T | T |
---|---|---|---|---|---|---|---|---|
x | x | x | x | x | x | x | x |
Downstream Task Dataset:
-
<a href="https://www.aivivn.com/contests/6">SC: Sentiment Classification (Phân loại sắc thái bình luận)</a>
-
<a href="https://huggingface.co/datasets/truongpdd/Covid-19-ner-lowercased">CN: Covid-19 NER</a>
-
<a href="https://huggingface.co/datasets/truongpdd/new_categorical_dataset">NC: News Classification</a>
-
<a href="https://huggingface.co/datasets/truongpdd/VSLP_2016_ASC">VSLP_2016_ASC</a>
Evaluation results
SC: Sentiment Classification (Phân loại sắc thái bình luận)
<a href=""> </a>
Metrics:
BibTeX entry and citation info
@article{2022,
title={x},
author={x},
journal={ArXiv},
year={2022},
volume={x}
}
<a href="https://huggingface.co/exbert/?model=anhdungitvn/vi-xlm-roberta-base"> <img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png"> </a>