Model description

An xlm-roberta-large model fine-tuned on ~1,6 million annotated statements contained in the Manifesto Corpus (version 2023a). The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme (Handbook 4). It works for all languages the xlm-roberta model is pretrained on (overview), just note that it will perform best for the 38 languages contained in the Manifesto Corpus:

armenian bosnian bulgarian catalan croatian
czech danish dutch english estonian
finnish french galician georgian german
greek hebrew hungarian icelandic italian
japanese korean latvian lithuanian macedonian
montenegrin norwegian polish portuguese romanian
russian serbian slovak slovenian spanish
swedish turkish ukrainian

How to use

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/manifestoberta-xlm-roberta-56policy-topics-sentence-2023-1-1")
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")

sentence = "We will restore funding to the Global Environment Facility and the Intergovernmental Panel on Climate Change, to support critical climate science research around the world"

inputs = tokenizer(sentence,
                   return_tensors="pt",
                   max_length=200,  #we limited the input to 200 tokens during finetuning
                   padding="max_length",
                   truncation=True
                   )

logits = model(**inputs).logits

probabilities = torch.softmax(logits, dim=1).tolist()[0]
probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
print(probabilities)
# {'501 - Environmental Protection: Positive': 67.28, '411 - Technology and Infrastructure': 15.19, '107 - Internationalism: Positive': 13.63, '416 - Anti-Growth Economy: Positive': 2.02...

predicted_class = model.config.id2label[logits.argmax().item()]
print(predicted_class)
# 501 - Environmental Protection: Positive

Model Performance

The model was evaluated on a test set of 199,046 annotated manifesto statements.

Overall

Accuracy Top2_Acc Top3_Acc Precision Recall F1_Macro MCC Cross-Entropy
Sentence Model 0.57 0.73 0.81 0.49 0.43 0.45 0.55 1.5
Context Model 0.64 0.81 0.88 0.54 0.52 0.53 0.62 1.15