Japanese BERT-base (Nothing + BPE)
How to load the tokenizer
Please download the dictionary file for Nothing + BPE from our GitHub repository.
Then you can load the tokenizer by specifying the path of the dictionary file to dict_path
.
from typing import Optional
from tokenizers import Tokenizer, NormalizedString, PreTokenizedString
from tokenizers.processors import BertProcessing
from tokenizers.pre_tokenizers import PreTokenizer
from transformers import PreTrainedTokenizerFast
# load a tokenizer
dict_path = /path/to/nothing_bpe.json
tokenizer = Tokenizer.from_file(dict_path)
tokenizer.post_processor = BertProcessing(
cls=("[CLS]", tokenizer.token_to_id('[CLS]')),
sep=("[SEP]", tokenizer.token_to_id('[SEP]'))
)
# convert to PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(
tokenizer_object=tokenizer,
unk_token='[UNK]',
cls_token='[CLS]',
sep_token='[SEP]',
pad_token='[PAD]',
mask_token='[MASK]'
)
# Test
test_str = "こんにちは。私は形態素解析器について研究をしています。"
tokenizer.convert_ids_to_tokens(tokenizer(test_str).input_ids)
# -> ['[CLS]','こん','に','ち','は','。','私は','形態','素','解析','器','について','研究','を','してい','ます','。','[SEP]']
How to load the model
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("hitachi-nlp/bert-base_nothing-bpe")
See our repository for more details!