GreekSocialBERT
A greek pre-trained language model based on GreekBERT. This model is an updated version of greeksocialbert-base-greek-uncased-v1.
Pre-training data
The pre-trained model GreekBERT is additionally trained on a corpus of 458,293 documents collected from greek social media (Twitter, Instagram, Facebook and YouTube), using GreekBERT's tokenizer.
The corpus has been provided by Palo LTD.
Requirements
pip install transformers
pip install torch
Pre-processing details
In order to use this model, the text needs to be pre-processed as follows:
- remove all greek diacritics
- convert to lowercase
import unicodedata
def preprocess(text):
text = text.lower()
text = unicodedata.normalize('NFD',text).translate({ord('\N{COMBINING ACUTE ACCENT}'):None})
return text
Load Model
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("pchatz/greeksocialbert-base-greek-social-media-v2")
model = AutoModelForMaskedLM.from_pretrained("pchatz/greeksocialbert-base-greek-social-media-v2")
You can use this model directly with a pipeline for masked language modeling:
from transformers import pipeline
fill = pipeline('fill-mask', model=model, tokenizer=tokenizer)
fill(f'μεσα {fill.tokenizer.mask_token} δικτυωσης')
Evaluation on MLM and Sentiment Analysis tasks
For detailed results refer to Thesis: 'Ανάλυση συναισθήματος κειμένου στα Ελληνικά με χρήση Δικτύων Μετασχηματιστών'
Author
Pavlina Chatziantoniou, Georgios Alexandridis and Athanasios Voulodimos
BibTeX entry and citation info
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623
@Article{info12080331,
AUTHOR = {Alexandridis, Georgios and Varlamis, Iraklis and Korovesis, Konstantinos and Caridakis, George and Tsantilas, Panagiotis},
TITLE = {A Survey on Sentiment Analysis and Opinion Mining in Greek Social Media},
JOURNAL = {Information},
VOLUME = {12},
YEAR = {2021},
NUMBER = {8},
ARTICLE-NUMBER = {331},
URL = {https://www.mdpi.com/2078-2489/12/8/331},
ISSN = {2078-2489},
DOI = {10.3390/info12080331}
}