<a name="introduction"></a> BERTabaporu: a genre-specific pre-trained model of Portuguese-speaking social media
Introduction
BERTabaporu is a Brazilian Portuguese BERT model in the Twitter domain. The model has been built from a collection of 238 million tweets written by over 100 thousand unique Twitter users, and conveying over 2.9 billion tokens in total.
Available models
Model | Arch. | #Layers | #Params |
---|---|---|---|
pablocosta/bertabaporu-base-uncased |
BERT-Base | 12 | 110M |
pablocosta/bertabaporu-large-uncased |
BERT-Large | 24 | 335M |
Usage
from transformers import AutoTokenizer # Or BertTokenizer
from transformers import AutoModelForPreTraining # Or BertForPreTraining for loading pretraining heads
from transformers import AutoModel # or BertModel, for BERT without pretraining heads
model = AutoModelForPreTraining.from_pretrained('pablocosta/bertabaporu-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('pablocosta/bertabaporu-base-uncased')
Cite us
@inproceedings{bertabaporu, author={Pablo Botton da Costa and Matheus Camasmie Pavan and Wesley Ramos dos Santos and Samuel Caetano da Silva and Ivandr'e Paraboni}, title={{BERTabaporu: assessing a genre-specific language model for Portuguese NLP}}, booktitle={Recents Advances in Natural Language Processing ({RANLP-2023})}, year={2023}, address={Varna, Bulgaria} }