beto-emoji
Fine-tunning BETO for emoji-prediction.
Repository
Details with training and a use example are shown in github.com/camilocarvajalreyes/beto-emoji. A deeper analysis of this and other models on the full dataset can be found in github.com/furrutiav/data-mining-2022. We have used this model for a project for CC5205 Data Mining course.
Example
Inspired by model card from cardiffnlp/twitter-roberta-base-emoji.
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request
# Preprocess text (username and link placeholders)
def preprocess(text):
new_text = []
for t in text.split(" "):
t = '@user' if t.startswith('@') and len(t) > 1 else t
t = 'http' if t.startswith('http') else t
new_text.append(t)
return " ".join(new_text)
MODEL = f"ccarvajal/beto-emoji"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/camilocarvajalreyes/beto-emoji/main/es_mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
html = f.read().decode('utf-8').split("\n")
csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)
text = "que viva espaΓ±a"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
l = labels[ranking[i]]
s = scores[ranking[i]]
print(f"{i+1}) {l} {np.round(float(s), 4)}")
Output
1) πͺπΈ 0.2508
2) π 0.238
3) π 0.2225
4) π 0.0806
5) β€ 0.0489
6) π 0.0415
7) π 0.0232
8) π 0.0229
9) π 0.0156
10) π 0.0119
11) π 0.0079
12) π 0.0077
13) πͺ 0.0066
14) π 0.0054
15) π 0.0052
16) π 0.005
17) π 0.0034
18) πΆ 0.0022
19) β¨ 0.0007
Results in test set
precision recall f1-score support
β€ 0.39 0.43 0.41 2141
π 0.29 0.39 0.33 1408
π 0.51 0.51 0.51 1499
π 0.09 0.05 0.06 352
π 0.12 0.23 0.16 514
π 0.24 0.23 0.24 397
πͺ 0.37 0.43 0.40 307
π 0.15 0.17 0.16 453
π 0.09 0.16 0.11 180
πͺπΈ 0.46 0.46 0.46 424
π 0.12 0.11 0.11 339
π 0.36 0.02 0.04 413
π 0.00 0.00 0.00 235
π 0.04 0.02 0.02 274
π 0.00 0.00 0.00 93
β¨ 0.26 0.12 0.17 416
πΆ 0.25 0.24 0.24 212
π 0.00 0.00 0.00 134
π 0.05 0.03 0.04 209
accuracy 0.30 10000
macro_avg 0.20 0.19 0.18 10000
weighted avg 0.29 0.30 0.29 10000
Another example with a visualisation of the attention modules within this model is carried out using bertviz.
Reproducibility
The Multilingual Emoji Prediction dataset (Barbieri et al. 2010) consists of tweets in English and Spanish that originally had a single emoji, which is later used as a tag. Test and trial sets can be downloaded here, but the train set needs to be downloaded using a twitter crawler. The goal is to predict that single emoji that was originally in the tweet using the text in it (out of a fixed set of possible emojis, 20 for English and 19 for Spanish).
Training parameters:
training_args = TrainingArguments(
output_dir="./results",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=5,
weight_decay=0.01
)