Model description
- This model was trained on ZH, JA, KO's Wikipedia (5 epochs).
How to use
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small")
model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small")
- Before you fine-tune downstream tasks, you don't need any text segmentation.
- (Though you may obtain better results if you applied morphological analysis to the data before fine-tuning)
Morphological analysis tools
- ZH: For Chinese, we use LTP.
- JA: For Japanese, we use Juman++.
- KO: For Korean, we use KoNLPy(Kkma class).
Tokenization
- We use character-based tokenization with whole-word-masking strategy.
Model size
- vocab_size: 15015
- num_hidden_layers: 4
- hidden_size: 512
- num_attention_heads: 8
- param_num: 25M