Model Card for HyeBERT
Pre-trained language model trained on Armenian using a masked language training strategy. The architecture is based on BERT but trained exclusively for the Armenian language subset of OSCAR, a cleaned and de-duplicated subset of the common crawl dataset (hence, the Hye in HyeBERT).
Disclaimer: this model is not specifically trained for either the Western or Eastern dialect, though the data likely contain more examples of Eastern Armenian.
Model Description
HyeBERT is shares the same architecture as BERT; it is a stacked transformer model trained on a large corpus of Armenian without any human annotations. However, it was trained using only the mask language task (replacing 15% of tokens with [MASK] and trying to predict them from the other tokens in the text) and not to predict the next sentence, making it more akin to RoBERTa. Unlike RoBERTa, however, it tokenizes using WordPiece rather than BPE.
Inteded Uses
Direct Use
As an MLM, this model can be used to predict word in a sentence or text generation, though generation would best be done with a model like GPT.
Downstream Use [optional]
The ideal use of this model is fine-tuning on a specific classification task for Armenian.
Bias, Risks, and Limitations
As mentioned earlier, this model is not trained exclusively on Western or Eastern Armenian which may lead to problems in its internal understanding of the language's syntax and lexicon. In addition, many of the training texts include content from other languages (mostly English and Russian) which may affect the performance of the model.
How to Get Started with the Model
Use the code below to get started with the model.
{{ get_started_code | default("[More Information Needed]", true)}}
Training Details
Training Data
This model was trained on the Armenian subset of the OSCAR corpus, which is a cleaned version of the common crawl. The training data consiset of roughly XXX document, with roughly YYY tokens in total. 2% of the total dataset was held out and using as validation.
Training Procedure
The model was trained by masking 15% of tokens and predicting the identity of those masked tokens from the unmasked items in a training datum. The model was trained over 3 epochs and the identify of the masked token for a given text was reassigned for each epoch, i.e., the masks moved around each epoch.
Preprocessing
No major preprocessing. Texts of less than 5 character were removed and texts were limited to 512 tokens.
Training Hyperparameters
- Optimizer: AdamW
- Learning rate: 1e4
- Num. attention head: 12
- Num. hidden layers: 6
- Vocab. size: 30,000
- Embedding size: 768
Evaluation
At each epoch's completion, the loss was computed for a held out validation set, roughly 2% the size of the total data.
0 evaluating....
	val_loss: 0.47787963975066194
1 evaluating....
	val_loss: 0.47497553823474115
2 evaluating....
	val_loss: 0.4765327044259816
Model Card Authors [optional]
Adam King
Model Card Contact
adam.king.phd@gmail.com
 
       
      