counter speech

Target-Aware Counter-Speech Generation

<!-- Provide a quick summary of what the model is/does. -->

The target-aware counter-speech generation model is an autoregressive generative language model fine-tuned on hate- and counter-speech pairs from the CONAN datasets for generating more contextually relevant counter-speech, based on the gpt2-medium model. The model utilizes special tokens that embedded target demographic information to guide the generation towards more relevant responses, avoiding off-topic and generic responses. The model is trained on 8 target demographics, including Migrants, People of Color (POC), LGBT+, Muslims, Women, Jews, Disabled, and Other.

Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> The model is intended for generating counter-speech responses for a given hate speech sequence, combined with special tokens for target-demographic embeddings.

Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

We observed negative effects such as content hallucination and toxic response generation. Though the intended use is to generate counter-speech for combating online hatred, the usage is to be monitored carefully with human post-editing or approval system, ensuring safe and inclusive online environment.

How to Get Started with the Model

Use the code below to get started with the model.

types = ["MIGRANTS", "POC", "LGBT+", "MUSLIMS", "WOMEN", "JEWS", "other", "DISABLED"] # A list of all available target-demographic tokens
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(tum-nlp/gpt-2-medium-target-aware-counterspeech-generation)
tokenizer = AutoTokenizer.from_pretrained(tum-nlp/gpt-2-medium-target-aware-counterspeech-generation)
tokenizer.padding_side = "left" 

prompt = "<|endoftext|> <other> Hate-speech: Human are not created equal, some are born lesser. Counter-speech: "
input = tokenizer(prompt, return_tensors="pt", padding=True)
output_sequences = model.generate(
        input_ids=inputs['input_ids'].to(model.device),
        attention_mask=inputs['attention_mask'].to(model.device),
        pad_token_id=tokenizer.eos_token_id,
        max_length=128,
        num_beams=3,
        no_repeat_ngram_size=3,
        num_return_sequences=1,
        early_stopping=True
    )
  result = tokenizer.decode(output_sequences, skip_special_tokens=True)

Training Hyperparameters

training_args = TrainingArguments(
  num_train_epochs=20,
  learning_rate=3.800568576836524e-05,
  weight_decay=0.050977894796868116,
  warmup_ratio=0.10816909354342182,
  optim="adamw_torch",
  lr_scheduler_type="cosine",
  evaluation_strategy="epoch",
  save_strategy="epoch",
  save_total_limit=3,
  load_best_model_at_end=True,
  auto_find_batch_size=True,
)

Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

Testing Data, Factors & Metrics

Testing Data

<!-- This should link to a Data Card if possible. -->

The model's performance is tested on three test sets, from which two are subsets of the CONAN dataset and one is the sexist portion of the EDOS dataset

Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

The model's performance is tested on a custom evaluation pipeline for counter-speech generation. The pipeline includes CoLA, Toxicity, Hatefulness, Offensiveness, Label and Context Similarity, Validity as Counter-Speech, Repetition Rate, target-demographic F1 and the Arithmetic Mean

Results

CONAN

Model Name CoLA TOX Hate OFF L Sim C Sim VaCS RR F1 AM
Human 0.937 0.955 1.000 0.997 - 0.751 0.980 0.861 0.885 0.929
target-aware gpt2-medium 0.958 0.946 1.000 0.996 0.706 0.784 0.946 0.419 0.880 0.848

CONAN SMALL

Model Name CoLA TOX Hate OFF L Sim C Sim VaCS RR F1 AM
Human 0.963 0.956 1.000 1.000 1.000 0.768 0.988 0.995 0.868 0.949
target-aware gpt2-medium 0.975 0.931 1.000 1.000 0.728 0.783 0.888 0.911 0.792 0.890

EDOS

Model Name CoLA TOX Hate OFF C Sim VaCS RR F1 AM
target-aware gpt2-medium 0.930 0.815 0.999 0.975 0.689 0.857 0.518 0.747 0.816