japanese text-generation gptj pytorch transformers t5tokenizer sentencepiece

This pre-trained model is work in progress! Model weight download will be available in the future.

A 6.8 billion parameter pre-trained model for Japanese language, based on EleutherAI's Mesh Transformer JAX, that has a similar model structure to their GPT-J-6B pre-trained model.

EleutherAIによるMesh Transformer JAXをコードベースとした、GPT-J-6Bに似たストラクチャと約68.7億パラメータを持つ日本語pre-trainedモデルです。

Specifications

Hyperparameter Value
n_parameters 6,876,450,080
n_layers 32
d_model 4,096
d_ff 16,384
n_heads 16
d_head 256
n_ctx 2,048
n_vocab 52,512
position encoding Rotary position encodings (RoPE)
RoPE dimensions 64

Instructions

We recommend to use finetuneanon's forked transformer codebase for inferencing as split checkpoint loads up a lot faster than monolithic checkpoint supported by HuggingFace Transformers repository.

The tokenizer still uses 50256 as the <|endoftext|> substitute. Therefore 50256 should be excluded when inferencing.

Datasets

Lack of quality Japanese corpus was one of the major challenges when we trained the model. We aimed to compile well-formatted corpuses outside of Common Crawl.

The dataset is normalized and sanitized against leading and trailing spaces, excessive CR/LF repetitions.

The whole dataset is about 400GB (as of October 2021) and 106B tokens (compared to 825GB/300B tokens for The Pile).

** Common Crawl

** Books

** News

** Wikipedia

** Other Corpuses