Vi-XLM-RoBERTa base model (uncased)

Epochs 0/40. Running Loss: 6.4104: 1%| | 25149/2442585 [7:18:31<639:12:16]

MODEL IS NOT BEING TRAINED (training on hold for a while)

Progress: ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 1%
Stared: 2022-09-26
Last Updated: 2022-09-28
Current Checkpoint: checkpoint-x
Being Trained

Logging:

<a href="https://wandb.ai/anhdungitvn/test/reports/global_step-22-10-06-08-59-07---VmlldzoyNzQ3NDc5?accessToken=seza8yu7owqp1qaoestn9fozlx4v5zzutg0y5qhj2ofwxmjb0r9wq5ko1iias27o">Global Step</a>
<a href="https://wandb.ai/anhdungitvn/test/reports/lr-22-10-06-08-58-48---VmlldzoyNzQ3NDc3?accessToken=nitqvpya34qd2eg50kt65w8jr4e2qexld2u396mj13h6uutal9clbfif0hey4gsq">Learning Rate </a>
<a href="https://wandb.ai/anhdungitvn/test/reports/Training-loss-22-10-06-08-57-37---VmlldzoyNzQ3NDcz?accessToken=okepl0ziwxa92r5374qd1ukgotq566r876ac7o0w6acld6ncs6o1rm08dppw31bn">Training Loss</a>
<a href="https://wandb.ai/anhdungitvn/test/reports/eval_loss-22-10-06-08-59-27---VmlldzoyNzQ3NDgx?accessToken=aqewlpxjm1h6np55ld3bmph062ixt2fkkdlv46s3yk20p63mh0ibkf4iuvl043je">Eval Loss</a>
<a href="https://wandb.ai/anhdungitvn/test/reports/perplexity-22-10-06-08-59-39---VmlldzoyNzQ3NDgy?accessToken=dowvrx240yi4w3b9qr8lq6rtzrnbl5h3ymflpxirg8cepf5cf1e10pcvvy7lcvda">Perplexity</a>

This model is Vietnamese XLM Robert, base, uncased.

Model description

Intended uses & limitations

You can use the raw model for either masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.

How to use

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import AutoTokenizer, XLMRobertaForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('anhdungitvn/vi-xlm-roberta-large')
model = XLMRobertaForMaskedLM.from_pretrained('anhdungitvn/vi-xlm-roberta-large')
text = "Câu bằng tiếng Việt."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

and in TensorFlow:

from transformers import AutoTokenizer, XLMRobertaForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('anhdungitvn/vi-xlm-roberta-large')
model = XLMRobertaForMaskedLM.from_pretrained('anhdungitvn/vi-xlm-roberta-large')
text = "Câu bằng tiếng Việt."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Limitations and bias

Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions.

Training data

Vietnamese Wiki (2022, 1GB) + Vietnamese News (Add Reference Here)

Training procedure

Model was pretrained:

Tokenizer: SentencePiece (BPE) with a vocab of 256000.
Model type: xlmroberta
Optimizer: AdamW
Learning rate: 500µ

Evaluation results

Pretraining metrics and results:

When fine-tuned on downstream tasks, this model achieves the following results:

Task	SC	CN	NC	VSLP_2016_ASC	T	T	T	T
	x	x	x	x	x	x	x	x

Downstream Task Dataset:

<a href="https://www.aivivn.com/contests/6">SC: Sentiment Classification (Phân loại sắc thái bình luận)</a>
<a href="https://huggingface.co/datasets/truongpdd/Covid-19-ner-lowercased">CN: Covid-19 NER</a>
<a href="https://huggingface.co/datasets/truongpdd/new_categorical_dataset">NC: News Classification</a>
<a href="https://huggingface.co/datasets/truongpdd/VSLP_2016_ASC">VSLP_2016_ASC</a>

Evaluation results

SC: Sentiment Classification (Phân loại sắc thái bình luận)

Metrics:

BibTeX entry and citation info

@article{2022,
  title={x},
  author={x},
  journal={ArXiv},
  year={2022},
  volume={x}
}

Vi-XLM-RoBERTa base model (uncased)

Model description

Intended uses & limitations

How to use

Limitations and bias

Training data

Training procedure

Evaluation results

Evaluation results

SC: Sentiment Classification (Phân loại sắc thái bình luận)

BibTeX entry and citation info

NSDT 3DConvert

UnrealSynth

DreamTexture.js