Model description

This model is vit5-base fine-tuned on the 300K Vietnamese News such as vnexpress.net, dantri.com.vn, laodong.vn, youtube dataset (link:https://drive.google.com/drive/folders/1RvywNl41QYNa2lthp-O8hakVCMsfX456) for tagging articles using the textual content as input. In Vietnamese model, there are three state-of-the art models such as ViELECTRA [1], PhoNLP [2], and ViT5 [3], ViDeBERTa [4]. These models well are applied for Part of Speech Tagging (POS), Dependency Parsing, Named Entity Recognition (NER), and Summarization problems. However, for tagging problem, we find that no Vietnamese model has unknown before in current. We then introduce some SOTA Vietnamese models as followings:

For tagging problem, due to application ability of vit5, we developed the vit5-base (or large) fine-tuned and comparison their efficiency in terms of quality and running time. In current, we publish the model based vit5-base while the model based vit5-large consumes much time to fine-tune. We will publish the model on the future.

Dataset

The dataset is composed of Vietnamese news and their avaialbe tags (the tags was done by human). We crawl 300K Vietnamese News such as vnexpress.net, dantri.com.vn, and laodong.vn. We divide them into two parts: 250K for training and 50K for testing. In each article, we extract two fields such as title, and tags. They are the input of the training phase. All data must be preprocessed (remove special characters, remove space duplication, etc).

Evaluation results

We evaluate the vit5-base fine-tuned according to some metrics:

[{'rouge1': 0.4159717966204846}, {'rouge2': 0.25983482833746485}, {'rougeL': 0.3770318612006469}, {'rougeLsum': 0.37699834479994276}]

Training hyperparameters

The following hyperparameters were used during training:

num_train_epochs=2,

learning_rate=1e-5,

warmup_ratio=0.05,

weight_decay=0.01,

per_device_train_batch_size=4,

per_device_eval_batch_size=4,

group_by_length=True,

I also evaluated the model on 20K dataset of video from youtube. We extract the title and tags (if possible) which is the input of the model. With videos with tags, we directly compare our tags with the existing tags. Otherwise, the obtained tags are evaluated by human. We see the results on link: https://drive.google.com/drive/folders/1RvywNl41QYNa2lthp-O8hakVCMsfX456

How to use the model

tokenizer = AutoTokenizer.from_pretrained("banhabang/vit5-base-tag-generation")

model = AutoModelForSeq2SeqLM.from_pretrained("banhabang/vit5-base-tag-generation")

model.to('cuda')

encoding = tokenizer(text, return_tensors="pt")

input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")

outputs = model.generate(input_ids=input_ids, attention_mask=attention_masks,max_length=30,early_stopping=True)

for output in outputs:

outputs = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)

Reference

[1] T. V. Bui, O. T. Tran, P. Le-Hong, Improving Sequence Tagging for Vietnamese Text using Transformer-based Neural Models, Proceedings of PACLIC 2020. link: https://github.com/fpt-corp/vELECTRA.

[2] Dat Quoc Nguyen and Anh-Tuan Nguyen. 2020. Phobert: Pre-trained language models for vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1037–1042.

[3] Long Phan, Hieu Tran, Hieu Nguyen, and Trieu H Trinh. 2022. Vit5: Pretrained text-to-text transformer for vietnamese language generation. arXiv preprint arXiv:2205.06457. link: https://github.com/vietai/ViT5