beto-emoji

Fine-tunning BETO for emoji-prediction.

Repository

Details with training and a use example are shown in github.com/camilocarvajalreyes/beto-emoji. A deeper analysis of this and other models on the full dataset can be found in github.com/furrutiav/data-mining-2022. We have used this model for a project for CC5205 Data Mining course.

Example

Inspired by model card from cardiffnlp/twitter-roberta-base-emoji.

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request
# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

MODEL = f"ccarvajal/beto-emoji"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/camilocarvajalreyes/beto-emoji/main/es_mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)
text =  "que viva españa"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

Output

1) 🇪🇸 0.2508
2) 😍 0.238
3) 👌 0.2225
4) 😂 0.0806
5) ❤ 0.0489
6) 😁 0.0415
7) 😜 0.0232
8) 😎 0.0229
9) 😊 0.0156
10) 😉 0.0119
11) 💜 0.0079
12) 💕 0.0077
13) 💪 0.0066
14) 💘 0.0054
15) 💙 0.0052
16) 💞 0.005
17) 😘 0.0034
18) 🎶 0.0022
19) ✨ 0.0007

Results in test set

             precision    recall  f1-score   support

       ❤       0.39      0.43      0.41      2141
       😍       0.29      0.39      0.33      1408
       😂       0.51      0.51      0.51      1499
       💕       0.09      0.05      0.06       352
       😊       0.12      0.23      0.16       514
       😘       0.24      0.23      0.24       397
       💪       0.37      0.43      0.40       307
       😉       0.15      0.17      0.16       453
       👌       0.09      0.16      0.11       180
       🇪🇸       0.46      0.46      0.46       424
       😎       0.12      0.11      0.11       339
       💙       0.36      0.02      0.04       413
       💜       0.00      0.00      0.00       235
       😜       0.04      0.02      0.02       274
       💞       0.00      0.00      0.00        93
       ✨       0.26      0.12      0.17       416
       🎶       0.25      0.24      0.24       212
       💘       0.00      0.00      0.00       134
       😁       0.05      0.03      0.04       209

 accuracy                           0.30     10000
macro_avg       0.20      0.19      0.18     10000
weighted avg    0.29      0.30      0.29     10000

Another example with a visualisation of the attention modules within this model is carried out using bertviz.

Reproducibility

The Multilingual Emoji Prediction dataset (Barbieri et al. 2010) consists of tweets in English and Spanish that originally had a single emoji, which is later used as a tag. Test and trial sets can be downloaded here, but the train set needs to be downloaded using a twitter crawler. The goal is to predict that single emoji that was originally in the tweet using the text in it (out of a fixed set of possible emojis, 20 for English and 19 for Spanish).

Training parameters:

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01
)

beto-emoji

Repository

Example

Results in test set

Reproducibility

NSDT 3DConvert

UnrealSynth

DreamTexture.js