GreekSocialBERT

A greek pre-trained language model based on GreekBERT. This model is an updated version of greeksocialbert-base-greek-uncased-v1.

Pre-training data

The pre-trained model GreekBERT is additionally trained on a corpus of 458,293 documents collected from greek social media (Twitter, Instagram, Facebook and YouTube), using GreekBERT's tokenizer.

The corpus has been provided by Palo LTD.

Requirements

pip install transformers
pip install torch

Pre-processing details

In order to use this model, the text needs to be pre-processed as follows:

remove all greek diacritics
convert to lowercase

import unicodedata

def preprocess(text):
  text = text.lower()
  text = unicodedata.normalize('NFD',text).translate({ord('\N{COMBINING ACUTE ACCENT}'):None})
  return text

Load Model

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("pchatz/greeksocialbert-base-greek-social-media-v2")

model = AutoModelForMaskedLM.from_pretrained("pchatz/greeksocialbert-base-greek-social-media-v2")

You can use this model directly with a pipeline for masked language modeling:

from transformers import pipeline

fill = pipeline('fill-mask', model=model, tokenizer=tokenizer)
fill(f'μεσα {fill.tokenizer.mask_token} δικτυωσης')

Evaluation on MLM and Sentiment Analysis tasks

For detailed results refer to Thesis: 'Ανάλυση συναισθήματος κειμένου στα Ελληνικά με χρήση Δικτύων Μετασχηματιστών'

Author

Pavlina Chatziantoniou, Georgios Alexandridis and Athanasios Voulodimos

BibTeX entry and citation info

http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18623


@Article{info12080331,
AUTHOR = {Alexandridis, Georgios and Varlamis, Iraklis and Korovesis, Konstantinos and Caridakis, George and Tsantilas, Panagiotis},
TITLE = {A Survey on Sentiment Analysis and Opinion Mining in Greek Social Media},
JOURNAL = {Information},
VOLUME = {12},
YEAR = {2021},
NUMBER = {8},
ARTICLE-NUMBER = {331},
URL = {https://www.mdpi.com/2078-2489/12/8/331},
ISSN = {2078-2489},
DOI = {10.3390/info12080331}
}