bert fill-mask

The BERT large-cased model trained on all Romanian tweets from 2008 to 2022.

How to use

import torch
from transformers import AutoTokenizer, AutoModel

# Load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("Iulian277/ro-bert-tweet-large")
model = AutoModel.from_pretrained("Iulian277/ro-bert-tweet-large")

# Sanitize the input
!pip install emoji
from normalize import normalize # Use the `normalize` function from the `normalize.py` script
normalized_text = normalize("Salut, ce faci?")

# Tokenize the sentence and run through the model
input_ids = torch.tensor(tokenizer.encode(normalized_text, add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)

# Get encoding
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple

Always use the normalize.py script included in the repository to sanitize you input text, before feeding the tokenizer. Otherwise you will decrease the performance due to the [UNK] tokens.

Acknowledgements

We'd like to thank TPU Research Cloud for helping us out with the TPU compute power needed to pretrain RoBERTweet models.