nagisa_bert

A BERT model for nagisa. The model is available in Transformers πŸ€—.

A tokenizer for nagisa_bert is available here.

Install

To use this model, the following python library must be installed. You can install nagisa_bert by using the pip command.

Python 3.7+ on Linux or macOS is required.

pip install nagisa_bert

Usage

This model is available in Transformer's pipeline method.

from transformers import pipeline
from nagisa_bert import NagisaBertTokenizer

text = "nagisaで[MASK]できるヒデルです"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
fill_mask = pipeline("fill-mask", model='taishi-i/nagisa_bert', tokenizer=tokenizer)
print(fill_mask(text))
[{'score': 0.1385931372642517,
  'sequence': 'nagisa で 使用 できる ヒデル です',
  'token': 8092,
  'token_str': 'δ½Ώ 用'},
 {'score': 0.11947669088840485,
  'sequence': 'nagisa で εˆ©η”¨ できる ヒデル です',
  'token': 8252,
  'token_str': '利 用'},
 {'score': 0.04910655692219734,
  'sequence': 'nagisa で 作成 できる ヒデル です',
  'token': 9559,
  'token_str': '作 成'},
 {'score': 0.03792576864361763,
  'sequence': 'nagisa で θ³Όε…₯ できる ヒデル です',
  'token': 9430,
  'token_str': 'θ³Ό ε…₯'},
 {'score': 0.026893319562077522,
  'sequence': 'nagisa で ε…₯手 できる ヒデル です',
  'token': 11273,
  'token_str': 'ε…₯ 手'}]

Tokenization and vectorization.

from transformers import BertModel
from nagisa_bert import NagisaBertTokenizer

text = "nagisaで[MASK]できるヒデルです"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
tokens = tokenizer.tokenize(text)
print(tokens)
# ['na', '##g', '##is', '##a', 'で', '[MASK]', 'できる', 'ヒデル', 'です']

model = BertModel.from_pretrained("taishi-i/nagisa_bert")
h = model(**tokenizer(text, return_tensors="pt")).last_hidden_state
print(h)
tensor([[[-0.2912, -0.6818, -0.4097,  ...,  0.0262, -0.3845,  0.5816],
         [ 0.2504,  0.2143,  0.5809,  ..., -0.5428,  1.1805,  1.8701],
         [ 0.1890, -0.5816, -0.5469,  ..., -1.2081, -0.2341,  1.0215],
         ...,
         [-0.4360, -0.2546, -0.2824,  ...,  0.7420, -0.2904,  0.3070],
         [-0.6598, -0.7607,  0.0034,  ...,  0.2982,  0.5126,  1.1403],
         [-0.2505, -0.6574, -0.0523,  ...,  0.9082,  0.5851,  1.2625]]],
       grad_fn=<NativeLayerNormBackward0>)

Model description

Architecture

The model architecture is the same as the BERT bert-base-uncased architecture (12 layers, 768 dimensions of hidden states, and 12 attention heads).

Training Data

The models is trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 8, 2022 with make_corpus_wiki.py and create_pretraining_data.py.

Training

The model is trained with the default parameters of transformers.BertConfig. Due to GPU memory limitations, the batch size is set to small; 16 instances per batch, and 2M training steps.

Tutorial

You can find here a list of the notebooks on Japanese NLP using pre-trained models and transformers.

Notebook Description
Fill-mask How to use the pipeline function in transformers to fill in Japanese text. Open in Colab
Feature-extraction How to use the pipeline function in transformers to extract features from Japanese text. Open in Colab
Embedding visualization Show how to visualize embeddings from Japanese pre-trained models. Open in Colab
How to fine-tune a model on text classification Show how to fine-tune a pretrained model on a Japanese text classification task. Open in Colab
How to fine-tune a model on text classification with csv files Show how to preprocess the data and fine-tune a pretrained model on a Japanese text classification task. Open in Colab