CypriotBERT

<p align="center"> A Cypriot version of BERT pre-trained language model. <img src="https://github.com/pedroandreou/Cypriot-LLM/raw/main/cypriot-bert-logo.png" width="300"/> </p>

Pre-training corpora

The bert-base-cypriot-uncased-v1 pre-training corpora consists of 133 documents sourced from Cypriot TV scripts and writings by Cypriot authors (7MB or 0.07GB of data in total).

Pre-training details

We trained BERT using our own established framework
We released a model similar to the English bert-base-uncased model but with 6 layers intead of 12 (6-layer, 768-hidden, 12-heads)
Total trainable params: 67M
We chose to follow the default parameter values (rather than following the same training set-up of 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4) as we gave more ephasis on establishing a framework that will help us train our models even better in the future.
We were able to use a Tesla V100-SXM2-32GB and train our model for a duration of 4 hours. Huge thanks to both MantisNLP and The Cyprus Insitute for supporting me!

Requirements

We published bert-base-cypriot-uncased-v1 as part of Hugging Face's Transformers repository. So, you need to install the transformers library through pip along with PyTorch.

pip install transformers[torch]

Pre-process text (Deaccent - Lower)

NOTICE: Preprocessing is now natively supported by the default tokenizer. No need to include the following code.

In order to use bert-base-cypriot-uncased-v1, you have to pre-process texts to lowercase letters and remove all Cypriot diacritics.

import unicodedata

def strip_accents_and_lowercase(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn').lower()

accented_string = "Τούτη εν η Κυπριακή έκδοση του BERT."
unaccented_string = strip_accents_and_lowercase(accented_string)

print(unaccented_string) # τουτη εν η κυπριακη εκδοση του bert.

Load Pretrained Model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("petros/bert-base-cypriot-uncased-v1")
model = AutoModel.from_pretrained("petros/bert-base-cypriot-uncased-v1")

Use Pretrained Model as a Language Model

import torch
from transformers import *

# Load model and tokenizer
tokenizer_cypriot = AutoTokenizer.from_pretrained('petros/bert-base-cypriot-uncased-v1')
lm_model_cypriot = AutoModelWithLMHead.from_pretrained('petros/bert-base-cypriot-uncased-v1')

# ================ EXAMPLE 1 ================
text_1 = 'Τι [MASK] ρε'
input_ids = tokenizer_cypriot.encode(text_1)
print(tokenizer_cypriot.convert_ids_to_tokens(input_ids)) # ['[CLS]', 'τι', '[MASK]', 'ρε', '[SEP]']

outputs = lm_model_cypriot(torch.tensor([input_ids]))[0]
print(tokenizer_cypriot.convert_ids_to_tokens(outputs[0, 2].max(0)[1].item())) # ειδους

# ================ EXAMPLE 2 ================
text_2 = 'Eίσαι μια [MASK].'
input_ids = tokenizer_cypriot.encode(text_2)
print(tokenizer_cypriot.convert_ids_to_tokens(input_ids)) #['[CLS]', 'eισ', '##αι', 'μια', '[MASK]', '.', '[SEP]']

outputs = lm_model_cypriot(torch.tensor([input_ids]))[0]
print(tokenizer_cypriot.convert_ids_to_tokens(outputs[0, 4].max(0)[1].item())) # χαρα

About Me

Petros Andreou

| Github: @pedroandreou |