A 19th Century Realism Tuning of GPT-2
This model is a fine-tuning of the GPT-2 model found here: https://huggingface.co/gpt2. This paper introduces the model.
Model Description
To develop this model, the standard GPT-2 model was fine-tuned on four 19th century realism novels. The GPT-2 pre-training process is self-supervised and involves the model predicting the next word in a sequence. You can learn more about the GPT-2 model at the link in the previous section.
For this model, the training process was extended on the more confined domain of 19th century realism to get a more specialized model. The training process was the same as the pre-training process.
Hyperparameters | |
---|---|
Activation Function | GELU |
Number of Layers | 12 |
Number of Heads | 12 |
Embedding Size | 768 |
Max. Sequence Length | 1024 |
FFNN Dimensionality | 3072 |
Residual Dropout Prob. | 0.1 |
Embedding Dropout Ratio | 0.1 |
Attention Dropout Rate | 0.1 |
Intended Uses and Limitations
This model is intended to be used for text generation.
How to Use
The model can be instantiated and used as follows.
from transformers import AutoModelForCausalLM, GPT2Tokenizer
model = AutoModelForCausalLM.from_pretrained('CharlieKincs/19th_century_gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(['I am going to lunch and', 'I was watching television the other day and saw'], padding=True, return_tensors='pt').input_ids
outputs = model.generate(inputs, max_new_tokens=110, do_sample=True, top_k=50, top_p=0.95)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
Limitations
This model was trained on unfiltered data from 19th century novels, so some of the generation may not be appropriate for modern times.
Training Data
This model was fine-tuned on the text from the following novels: The Idiot by Fyodor Dostoevsky (244,592 tokens), The Brothers Karamazov by Fyodor Dostoevsky (353,886 tokens), Anna Karenina by Leo Tolstoy (352,812 tokens), and David Copperfield by Charles Dickens (357,861 tokens). The total number of tokens in this corpus is 1,309,151. You can find the corpus on https://github.com/charliekincs/tuned_gpt2_corpus.
The corpus was collected on February 21, 2023 from Project Gutenberg. It was collected using the curl command and the goal was to assemble a corpus that reflected the style of 19th century realism.
The text was pre-processed as shown on the GPT-2 page (link in first section). The particular tokenizer object that was used was GPT2Tokenizer.
Evaluation
The following table shows the loss reported during training:
Step | Training Loss |
---|---|
500 | 4.039800 |
1000 | 3.867400 |
1500 | 3.805600 |
2000 | 3.746300 |
2500 | 3.706000 |
3000 | 3.673700 |
3500 | 3.676600 |
4000 | 3.634000 |
4500 | 3.651300 |
5000 | 3.651000 |
5500 | 3.611800 |
6000 | 3.595200 |
6500 | 3.603700 |
7000 | 3.588500 |
7500 | 3.568300 |
8000 | 3.577000 |
8500 | 3.571200 |
9000 | 3.555600 |
9500 | 3.561500 |
10000 | 3.545800 |
10500 | 3.574700 |
11000 | 3.515000 |
11500 | 3.550900 |
12000 | 3.504600 |
12500 | 3.513500 |
13000 | 3.505500 |
13500 | 3.518900 |
14000 | 3.477200 |
14500 | 3.506300 |
15000 | 3.488400 |
15500 | 3.515900 |
16000 | 3.516000 |
16500 | 3.503700 |
17000 | 3.468400 |
17500 | 3.485000 |
18000 | 3.486700 |
18500 | 3.278000 |
19000 | 3.285900 |
19500 | 3.253900 |
20000 | 3.289800 |
20500 | 3.251700 |
21000 | 3.289900 |
21500 | 3.291400 |
22000 | 3.267700 |
22500 | 3.275400 |
23000 | 3.278800 |
23500 | 3.251600 |
24000 | 3.286700 |
24500 | 3.259500 |
25000 | 3.295000 |
25500 | 3.271200 |
26000 | 3.290600 |
26500 | 3.274200 |
27000 | 3.307400 |
27500 | 3.273600 |
28000 | 3.284300 |
28500 | 3.272800 |
29000 | 3.284500 |
29500 | 3.292500 |
30000 | 3.268700 |
30500 | 3.261500 |
31000 | 3.263300 |
31500 | 3.243900 |
32000 | 3.266300 |
32500 | 3.264600 |
33000 | 3.265800 |
33500 | 3.260700 |
34000 | 3.283200 |
34500 | 3.292100 |
35000 | 3.280100 |
35500 | 3.286100 |
36000 | 3.236600 |
36500 | 3.078300 |
37000 | 3.079400 |
37500 | 3.105900 |
38000 | 3.109900 |
38500 | 3.105100 |
39000 | 3.097400 |
39500 | 3.093500 |
40000 | 3.116700 |
40500 | 3.128900 |
41000 | 3.123000 |
41500 | 3.112500 |
42000 | 3.106000 |
42500 | 3.118700 |
43000 | 3.110200 |
43500 | 3.099000 |
44000 | 3.091400 |
44500 | 3.129300 |
45000 | 3.123600 |
45500 | 3.122000 |
46000 | 3.092900 |
46500 | 3.120100 |
47000 | 3.120900 |
47500 | 3.113600 |
48000 | 3.109800 |
48500 | 3.097600 |
49000 | 3.109300 |
49500 | 3.122900 |
50000 | 3.131400 |
50500 | 3.129000 |
51000 | 3.097700 |
51500 | 3.114600 |
52000 | 3.124800 |
52500 | 3.121900 |
53000 | 3.080900 |
53500 | 3.088000 |