Indojave: IndoBERTweet-base

About

This is a pre-trained masked language model for code-mixed Indonesian-Javanese-English tweets data. This model is trained based on IndoBERTweet model utilizing Hugging Face's Transformers library.

Pre-training Data

The Twitter data is collected from January 2022 until January 2023. The tweets are collected using 8698 random keyword phrases. To make sure the retrieved data are code-mixed, we use keyword phrases that contain code-mixed Indonesian, Javanese, or English words. The following are few examples of the keyword phrases:

travelling terus
proud koncoku
great kalian semua
chattingane ilang
baru aja launching

We acquire 40,788,384 raw tweets. We apply first stage pre-processing tasks such as:

remove duplicate tweets,
remove tweets with token length less than 5,
remove multiple space,
convert emoticon,
convert all tweets to lower case.

After the first stage pre-processing, we obtain 17,385,773 tweets. In the second stage pre-processing, we do the following pre-processing tasks:

split the tweets into sentences,
remove sentences with token length less than 4,
convert ‘@username’ to ‘@USER’,
convert URL to HTTPURL.

Finally, we have 28,121,693 sentences for the training process. This pretraining data will not be opened to public due to Twitter policy.

Model

Model name	Base model	Size of training data	Size of validation data
`indojave-codemixed-indobertweet-base`	IndoBERTweet	2.24 GB of text	249 MB of text

Evaluation Results

We train the data with 3 epochs and total steps of 296K for 4 days. The following are the results obtained from the training:

train loss	eval loss	eval perplexity
2.7145	2.4854	12.0054

How to use

Load model and tokenizer

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("fathan/indojave-codemixed-indobertweet-base")
model = AutoModel.from_pretrained("fathan/indojave-codemixed-indobertweet-base")

Masked language model

from transformers import pipeline

pretrained_model = "fathan/indojave-codemixed-indobertweet-base"

fill_mask = pipeline(
    "fill-mask",
    model=pretrained_model,
    tokenizer=pretrained_model
)

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 256
eval_batch_size: 256
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 3.0

Framework versions

Transformers 4.26.0
Pytorch 1.12.0+cu102
Datasets 2.9.0
Tokenizers 0.12.1