Model description
This model is a fine-tuned version of the bert-base-german-cased model by deepset to classify German-language engaging comments.
How to use
You can use the model with the following code.
#!pip install transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TextClassificationPipeline
model_path = "ankekat1000/engaging-bert-german"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer)
print(pipeline('Tolle Idee. Ich denke, dass dieses Projekt Teil des Stadtforums werden sollte, damit wir darüber weiter nachdenken können!'))
Training
The pre-trained model bert-base-german-cased model by deepset was fine-tuned on a crowd-annotated data set of over 14,000 user comments that has been labeled for toxicity in a binary classification task.
As engaging, we defined comments that are enriching and valuble to a deliberative discussion in whole or in part, such as comments that add arguments, suggestions, or new perspectives to the discussion, or otherwise help users find them stimulating or appreciative.
Language model: bert-base-cased (~ 12GB)
Language: German
Labels: Engaging (binary classification)
Training data: User comments posted to websites and facebook pages of German news media, user comments posted to online participation platforms (~ 14,000)
Labeling procedure: Crowd annotation
Batch size: 32
Epochs: 4
Max. tokens length: 512
Infrastructure: 1x RTX 6000
Published: Oct 24th, 2023
Evaluation results
Accuracy:: 86%
Macro avg. f1:: 86%
Label | Precision | Recall | F1 | Nr. comments in test set |
---|---|---|---|---|
not toxic | 0.87 | 0.84 | 0.86 | 701 |
toxic | 0.84 | 0.87 | 0.85 | 667 |