beto-emoji

Fine-tunning BETO for emoji-prediction.

Repository

Details with training and a use example are shown in github.com/camilocarvajalreyes/beto-emoji. A deeper analysis of this and other models on the full dataset can be found in github.com/furrutiav/data-mining-2022. We have used this model for a project for CC5205 Data Mining course.

Example

Inspired by model card from cardiffnlp/twitter-roberta-base-emoji.

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request
# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

MODEL = f"ccarvajal/beto-emoji"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/camilocarvajalreyes/beto-emoji/main/es_mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)
text =  "que viva espaΓ±a"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

Output

1) πŸ‡ͺπŸ‡Έ 0.2508
2) 😍 0.238
3) πŸ‘Œ 0.2225
4) πŸ˜‚ 0.0806
5) ❀ 0.0489
6) 😁 0.0415
7) 😜 0.0232
8) 😎 0.0229
9) 😊 0.0156
10) πŸ˜‰ 0.0119
11) πŸ’œ 0.0079
12) πŸ’• 0.0077
13) πŸ’ͺ 0.0066
14) πŸ’˜ 0.0054
15) πŸ’™ 0.0052
16) πŸ’ž 0.005
17) 😘 0.0034
18) 🎢 0.0022
19) ✨ 0.0007

Results in test set

             precision    recall  f1-score   support

       ❀       0.39      0.43      0.41      2141
       😍       0.29      0.39      0.33      1408
       πŸ˜‚       0.51      0.51      0.51      1499
       πŸ’•       0.09      0.05      0.06       352
       😊       0.12      0.23      0.16       514
       😘       0.24      0.23      0.24       397
       πŸ’ͺ       0.37      0.43      0.40       307
       πŸ˜‰       0.15      0.17      0.16       453
       πŸ‘Œ       0.09      0.16      0.11       180
       πŸ‡ͺπŸ‡Έ       0.46      0.46      0.46       424
       😎       0.12      0.11      0.11       339
       πŸ’™       0.36      0.02      0.04       413
       πŸ’œ       0.00      0.00      0.00       235
       😜       0.04      0.02      0.02       274
       πŸ’ž       0.00      0.00      0.00        93
       ✨       0.26      0.12      0.17       416
       🎢       0.25      0.24      0.24       212
       πŸ’˜       0.00      0.00      0.00       134
       😁       0.05      0.03      0.04       209

 accuracy                           0.30     10000
macro_avg       0.20      0.19      0.18     10000
weighted avg    0.29      0.30      0.29     10000

Another example with a visualisation of the attention modules within this model is carried out using bertviz.

Reproducibility

The Multilingual Emoji Prediction dataset (Barbieri et al. 2010) consists of tweets in English and Spanish that originally had a single emoji, which is later used as a tag. Test and trial sets can be downloaded here, but the train set needs to be downloaded using a twitter crawler. The goal is to predict that single emoji that was originally in the tweet using the text in it (out of a fixed set of possible emojis, 20 for English and 19 for Spanish).

Training parameters:

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01
)