t5-base-japanese-web (with Byte-fallback, 32K)

Description

megagonlabs/t5-base-japanese-web is a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts.
Training codes are available on GitHub.

The vocabulary size of this model is 32K. 8K version is also available.

Corpora

We used following corpora for pre-training.

Japanese in mC4/3.0.1 (We used Tensorflow native format)
- 87,425,304 pages
- 782 GB in TFRecord format
Japanese in wiki40b/1.3.0
- 828,236 articles (2,073,584 examples)
- 2 GB in TFRecord format

Tokenizer

We used Japanese Wikipedia to train SentencePiece.

Vocabulary size: 32,000
Byte-fallback: Enabled

Parameters

T5 model: models/t5.1.1.base.gin
Training steps: 1,000,000

It took about 126 hours with TPU v3-8

Related models

License

Apache License 2.0

Citations

Contains information from mC4 which is made available under the ODC Attribution License.

@article{2019t5,
    author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
    title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
    journal = {arXiv e-prints},
    year = {2019},
    archivePrefix = {arXiv},
    eprint = {1910.10683},
}

wiki40b

@inproceedings{49029,
title = {Wiki-40B: Multilingual Language Model Dataset},
author = {Mandy Guo and Zihang Dai and Denny Vrandecic and Rami Al-Rfou},
year = {2020},
booktitle   = {LREC 2020}
}