This is RoBERTa model pretrained on texts in the Japanese language.

3.45GB wikipedia text

trained 1.65M step

use the sentencepiece tokenizer.

If you want to fine-tune model. Please use

Max_len = 510

from transformers import AlbertTokenizer, RobertaModel
AlbertTokenizer.from_pretrained('souseki_sentencepiece.model')
RoBERTModel.from_pretrained('pytorch_model.bin')

Caution: Please set Max_len = 512-2 =510 (2^x - 2)

The accuracy in JGLUE-marc_ja-v1.0 binary sentiment classification 95.4%

Contribute by Yokohama Nationaly University Mori Lab

@article{liu2019roberta,
title={Roberta: A robustly optimized bert pretraining approach},
author={Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin},
journal={arXiv preprint arXiv:1907.11692},
year={2019}
}