Machine Reading Comprehension Vietnamese

Overview

Language model: xlm-roberta-base
Language: Vietnamese
Downstream-task: Extractive QA
Dataset: UIT-ViQuAD2.0
Dataset Format: SQuAD 2.0
Infrastructure: cuda Tesla P100-PCIE-16GB (Google Colab)

Requirements

The following modules are essential for running the trainer:

transformers
datasets
evaluate
numpy

Run the following commands to install the required libraries:

>>> pip install datasets evaluate numpy
>>> pip install git+https://github.com/huggingface/transformers

Hyperparameter

batch_size = 16
n_epochs = 10
base_LM_model = "xlm-roberta-base"
max_seq_len = 256
learning_rate = 2e-5
weight_decay = 0.01

Performance

Evaluated on the UIT-ViQuAD2.0 dev set with the official eval script.

 'exact': 29.947276,
 'f1': 43.627568,
 'total': 2845,
 'HasAns_exact': 43.827160,
 'HasAns_f1': 63.847958,
 'HasAns_total': 1944,
 'NoAns_exact': 0.0,
 'NoAns_f1': 0.0,
 'NoAns_total': 901

Usage

from transformers import {
    AutoModelForQuestionAnswering,
    AutoTokenizer,
    pipeline
}

model_checkpoint = "results/checkpoint-16000"
question_answerer = pipeline("question-answering", model = model_checkpoint)

# a) get predictions
QA_input = {
    'question': 'Hiến pháp Mali quy định thế nào đối với tôn giáo?',
    'context': 'Ước tính có khoảng 90% dân số Mali theo đạo Hồi (phần lớn là hệ phái Sunni), khoảng 5% là theo Kitô giáo (khoảng hai phần ba theo Giáo hội Công giáo Rôma và một phần ba là theo Tin Lành) và 5% còn lại theo các tín ngưỡng vật linh truyền thống bản địa. Một số ít người Mali theo thuyết vô thần và thuyết bất khả tri, phần lớn họ thực hiện những nghi lễ tôn giáo cơ bản hằng ngày. Các phong tục Hồi giáo ở Mali có mức độ vừa phải, khoan dung, và đã thay đổi theo các điều kiện của địa phương; các mối quan hệ giữa người Hồi giáo và các cộng đồng tôn giáo nhỏ khác nói chung là thân thiện. Hiến pháp của Mali đã quy định một thể chế nhà nước thế tục và ủng hộ quyền tự do tôn giáo, và chính phủ Mali phải đảm bảo quyền này.'
}
res = question_answerer(QA_input)

# b) Load model & tokenizer
model     = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Author

Duc Nguyen

Citation

Kiet Van Nguyen, Son Quoc Tran, Luan Thanh Nguyen, Tin Van Huynh, Son T. Luu, Ngan  Luu-Thuy Nguyen. "VLSP 2021 Shared Task: Vietnamese Machine Reading  Comprehension." The 8th International Workshop on Vietnamese Language and Speech  Processing (VLSP 2021) .