Finetuend xlm-roberta-base
model on Thai sequence and token classification datasets
<br>
Finetuned XLM Roberta BASE model on Thai sequence and token classification datasets The script and documentation can be found at this repository.
<br>
Model description
<br>
We use the pretrained cross-lingual RoBERTa model as proposed by [Conneau et al., 2020]. We download the pretrained PyTorch model via HuggingFace's Model Hub (https://huggingface.co/xlm-roberta-base) <br>
Intended uses & limitations
<br>
You can use the finetuned models for multiclass/multilabel text classification and token classification task.
<br>
Multiclass text classification
-
wisesight_sentiment
4-class text classification task (
positive
,neutral
,negative
, andquestion
) based on social media posts and tweets. -
wongnai_reivews
Users' review rating classification task (scale is ranging from 1 to 5)
-
generated_reviews_enth
: (review_star
as label)Generated users' review rating classification task (scale is ranging from 1 to 5).
Multilabel text classification
-
prachathai67k
Thai topic classification with 12 labels based on news article corpus from prachathai.com. The detail is described in this page.
Token classification
-
thainer
Named-entity recognition tagging with 13 named-entities as descibed in this page.
-
lst20
: NER NER and POS taggingNamed-entity recognition tagging with 10 named-entities and Part-of-Speech tagging with 16 tags as descibed in this page.
<br>
How to use
<br>
The example notebook demonstrating how to use finetuned model for inference can be found at this Colab notebook
<br>
BibTeX entry and citation info
@misc{lowphansirikul2021wangchanberta,
title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
year={2021},
eprint={2101.09635},
archivePrefix={arXiv},
primaryClass={cs.CL}
}