KoGPT2-small
Model |
Batch Size |
Tokenizer |
Vocab Size |
Max Length |
Parameter Size |
GPT2 |
64 |
BPE |
30,000 |
1024 |
108M |
DataSet
- AIhub - 웹데이터 기반 한국어 말뭉치 데이터 (4.8M)
- KoWiki dump 230701 (1.4M)
Inference Example
from transformers import AutoTokenizer, GPT2LMHeadModel
text = "출근이 힘들면"
tokenizer = AutoTokenizer.from_pretrained('Datascience-Lab/GPT2-small')
model = GPT2LMHeadModel.from_pretrained('Datascience-Lab/GPT2-small')
inputs = tokenizer.encode_plus(text, return_tensors='pt', add_special_tokens=False)
outputs = model.generate(inputs['input_ids'], max_length=128,
repetition_penalty=2.0,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
use_cache=True,
temperature = 0.5)
outputs = tokenizer.decode(outputs[0], skip_special_tokens=True)