xlnet-large-bahasa-cased

Pretrained XLNET large language model for Malay.

Pretraining Corpus

xlnet-large-bahasa-cased model was pretrained on ~1.4 Billion words. Below is list of data we trained on,

  1. cleaned local texts.
  2. translated The Pile.

Pretraining details

Load Pretrained Model

You can use this model by installing torch or tensorflow and Huggingface library transformers. And you can use it directly by initializing it like this:

from transformers import XLNetModel, XLNetTokenizer

model = XLNetModel.from_pretrained('malay-huggingface/xlnet-large-bahasa-cased')
tokenizer = XLNetTokenizer.from_pretrained(
    'malay-huggingface/xlnet-large-bahasa-cased',
    do_lower_case = False,
)