bert-finetuned-sentiment-chinese

This model is a fine-tuned version of bert-base-chinese on the 24000 samples of the Douban Movies Short Comments from Kaggle.

[Douban.com](https://en.wikipedia.org/wiki/Douban#:~:text=Douban.com%20(Chinese%3A%20%E8%B1%86%E7%93%A3,and%20activities%20in%20Chinese%20cities.) (Chinese: 豆瓣; pinyin: Dòubàn), launched on 6 March 2005, is a Chinese social networking service website that allows registered users to record information and create content related to film, books, music, recent events, and activities in Chinese cities.

It achieves the following results on the evaluation set of 6000 samples:

Loss: 0.4446
F1: 0.5309
Roc Auc: 0.7040
Accuracy: 0.512

Using Hosted inference API

Input text in Chinese and wait for the sentiment label assigned. Example input: 连奥创都知道整容要去韩国

-> "I like this very much", so it gives Star_5.

Using in code

from transformers import pipeline

classifer = pipeline('sentiment-analysis',model='bert-finetuned-semantic-chinese/checkpoint-15000')
classifer('我非常喜歡這個')

Model description

Multilabel Text Classification based on the semantics.

Following Labels are assigned to the input text: ['Star_1', 'Star_2', 'Star_3', 'Star_4', 'Star_5'].

Star_1 - very negative

Star_2 - negative

Star_3 - neutral

Star_4 - positive.

Star_5 - very positive.

Intended uses & limitations

Limitations: may observe bias present in bert-base-chinese and the Douban Movies Short Comments from Kaggle.

Training procedures

Trained with PyTorch: -Splitting dataframe to train and test -One hot encoding -Setting AutoTokenizer as bert-base-chinese -Encoding the dataset -Setting AutoModelForSequenceClassification and setting problem type as "multi_label_classification" -Setting Training arguments -Trainng with Hugging Face trainer -pushing to hub

Training and evaluation data

24000 - Training. 6000 - Evaluation

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 5

Training results

Training Loss	Epoch	Step	Validation Loss	F1	Roc Auc	Accuracy
0.3683	1.0	3000	0.3569	0.4709	0.6613	0.3848
0.3284	2.0	6000	0.3677	0.5179	0.6931	0.478
0.2874	3.0	9000	0.4007	0.5209	0.6967	0.4943
0.2309	4.0	12000	0.4446	0.5309	0.7040	0.512
0.1828	5.0	15000	0.5096	0.5298	0.7040	0.515

Framework versions

Transformers 4.21.1
Pytorch 1.12.1+cu113
Datasets 2.4.0
Tokenizers 0.12.1