ALBert

The ALR-Bert , cased model for Romanian, trained on a 15GB corpus! ALR-BERT is a multi-layer bidirectional Transformer encoder that shares ALBERT's factorized embedding parameterization and cross-layer sharing. ALR-BERT-base inherits ALBERT-base and features 12 parameter-sharing layers, a 128-dimension embedding size, 768 hidden units, 12 heads, and GELU non-linearities. Masked language modeling (MLM) and sentence order prediction (SOP) losses are the two objectives that ALBERT is pre-trained on. For ALR-BERT, we preserve both these objectives. The model was trained using 40 batches per GPU (for 128 sequence length) and then 20 batches per GPU (for 512 sequence length). Layer-wise Adaptive Moments optimizer for Batch (LAMB) training was utilized, with a warm-up over the first 1% of steps up to a learning rate of 1e4, then a decay. Eight NVIDIA Tesla V100 SXM3 with 32GB memory were used, and the pre-training process took around 2 weeks per model.

Training methodology follows closely work previous done in Romanian Bert (https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1)

How to use


from transformers import AutoTokenizer, AutoModel

import torch

# load tokenizer and model

tokenizer = AutoTokenizer.from_pretrained("dragosnicolae555/ALR_BERT")

model = AutoModel.from_pretrained("dragosnicolae555/ALR_BERT")

#Here add your magic

Remember to always sanitize your text! Replace s and t cedilla-letters to comma-letters with :

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

because the model was NOT trained on cedilla s and ts. If you don't, you will have decreased performance due to <UNK>s and increased number of tokens per word.

Evaluation

Here, we evaluate ALR-BERT on Simple Universal Dependencies task. One model for each task, evaluating labeling performance on the UPOS (Universal Part-of-Speech) and the XPOS (Extended Part-of-Speech) (eXtended Part-of-Speech). We compare our proposed ALR-BERT with Romanian BERT and multiligual BERT, using the cased version. To counteract the random seed effect, we repeat each experiment five times and simply provide the mean score.

Model UPOS XPOS MLAS AllTags
M-BERT (cased) 93.87 89.89 90.01 87.04
Romanian BERT (cased) 95.56 95.35 92.78 93.22
ALR-BERT (cased) 87.38 84.05 79.82 78.82

Corpus

The model is trained on the following corpora (stats in the table below are after cleaning):

Corpus Lines(M) Words(M) Chars(B) Size(GB)
OPUS 55.05 635.04 4.045 3.8
OSCAR 33.56 1725.82 11.411 11
Wikipedia 1.54 60.47 0.411 0.4
Total 90.15 2421.33 15.867 15.2