nlp lm mlm

Pre-trained DeBERTaV2 Language Model for Vietnamese Nôm

DeBERTaV2ForMaskedLM, also known as DeBERTaV2 for short, is an advanced variant of the DeBERTa model specifically optimized for masked language modeling (MLM) tasks. Built upon the success of DeBERTa, DeBERTaV2 incorporates further enhancements to improve the model's performance and capabilities in understanding and generating natural language.

Pre-trained model called "DeBERTaForMaskedLM" designed exclusively for Chữ Nôm, the traditional Vietnamese writing system

Model was trained on some literary works and poetry: Bai ca ran co bac, Buom hoa tan truyen, Chinh phu ngam, Gia huan ca, Ho Xuan Huong, Luc Van Tien, Tale of Kieu-1870, Tale of Kieu 1871, Tale of kieu 1902,...

Nôm language models

Chữ Nôm language models refer to language models specifically designed and trained to understand and generate text in Chữ Nôm, the traditional writing system used for Vietnamese prior to the 20th century. These language models are trained using large datasets of Chữ Nôm texts to learn the patterns, grammar, and vocabulary specific to this writing system.

Develop Nôm language model

Developing a high-quality Chữ Nôm language model requires a substantial amount of specialized data and expertise. Here are the general steps involved in creating a Chữ Nôm language model:

  1. Data Collection: Gather a sizable corpus of Chữ Nôm texts. This can include historical documents, literature, poetry, and other written materials in Chữ Nôm. It's essential to ensure the dataset covers a wide range of topics and genres.
  2. Data Preprocessing: Clean and preprocess the Chữ Nôm dataset. This step involves tokenization, normalization, and segmentation of the text into individual words or characters. Additionally, special attention needs to be given to handling ambiguities, variant spellings, and character forms in Chữ Nôm.
  3. Model Architecture: Select an appropriate neural network architecture for your Chữ Nôm language model. Popular choices include transformer-based architectures like BERT, GPT, or their variants, which have shown strong performance in various NLP tasks.
  4. Model Training: Train the Chữ Nôm language model on your preprocessed dataset. This typically involves pretraining the model on a masked language modeling objective, where the model predicts masked or missing tokens in a sentence. Additionally, you can employ other pretraining tasks like next sentence prediction or document-level modeling to enhance the model's understanding of context.
  5. Fine-tuning: Fine-tune the pretrained model on specific downstream tasks or domains relevant to Chữ Nôm. This step involves training the model on task-specific datasets or applying transfer learning techniques to adapt the model to more specific tasks

How to use the model

from transformers import RobertaTokenizerFast, RobertaForMaskedLM
# Load the tokenizer
tokenizer = RobertaTokenizerFast.from_pretrained('minhtoan/DeBERTa-MLM-Vietnamese-Nom')

# Load the model
model = RobertaForMaskedLM.from_pretrained('minhtoan/DeBERTa-MLM-Vietnamese-Nom')

# Example input sentence with a masked token
input_sentence = '想払𨀐' + '[MASK]'

# Mask the token
mask_token_index = (input_tokens[0] == tokenizer.mask_token_id).nonzero()
input_tokens[0, mask_token_index] = tokenizer.mask_token_id

# Generate predictions
with torch.no_grad():
    outputs = model(input_tokens)
    predictions = outputs.logits.argmax(dim=-1)

# Decode and print the predicted word
predicted_word = tokenizer.decode(predictions[0, mask_token_index].item())
print("Predicted word:", predicted_word)

Author

Phan Minh Toan