A 19th Century Realism Tuning of GPT-2

This model is a fine-tuning of the GPT-2 model found here: https://huggingface.co/gpt2. This paper introduces the model.

Model Description

To develop this model, the standard GPT-2 model was fine-tuned on four 19th century realism novels. The GPT-2 pre-training process is self-supervised and involves the model predicting the next word in a sequence. You can learn more about the GPT-2 model at the link in the previous section.

For this model, the training process was extended on the more confined domain of 19th century realism to get a more specialized model. The training process was the same as the pre-training process.

Hyperparameters
Activation Function GELU
Number of Layers 12
Number of Heads 12
Embedding Size 768
Max. Sequence Length 1024
FFNN Dimensionality 3072
Residual Dropout Prob. 0.1
Embedding Dropout Ratio 0.1
Attention Dropout Rate 0.1

Intended Uses and Limitations

This model is intended to be used for text generation.

How to Use

The model can be instantiated and used as follows.

from transformers import AutoModelForCausalLM, GPT2Tokenizer


model = AutoModelForCausalLM.from_pretrained('CharlieKincs/19th_century_gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

inputs = tokenizer(['I am going to lunch and', 'I was watching television the other day and saw'], padding=True, return_tensors='pt').input_ids
outputs = model.generate(inputs, max_new_tokens=110, do_sample=True, top_k=50, top_p=0.95)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

Limitations

This model was trained on unfiltered data from 19th century novels, so some of the generation may not be appropriate for modern times.

Training Data

This model was fine-tuned on the text from the following novels: The Idiot by Fyodor Dostoevsky (244,592 tokens), The Brothers Karamazov by Fyodor Dostoevsky (353,886 tokens), Anna Karenina by Leo Tolstoy (352,812 tokens), and David Copperfield by Charles Dickens (357,861 tokens). The total number of tokens in this corpus is 1,309,151. You can find the corpus on https://github.com/charliekincs/tuned_gpt2_corpus.

The corpus was collected on February 21, 2023 from Project Gutenberg. It was collected using the curl command and the goal was to assemble a corpus that reflected the style of 19th century realism.

The text was pre-processed as shown on the GPT-2 page (link in first section). The particular tokenizer object that was used was GPT2Tokenizer.

Evaluation

The following table shows the loss reported during training:

Step Training Loss
500 4.039800
1000 3.867400
1500 3.805600
2000 3.746300
2500 3.706000
3000 3.673700
3500 3.676600
4000 3.634000
4500 3.651300
5000 3.651000
5500 3.611800
6000 3.595200
6500 3.603700
7000 3.588500
7500 3.568300
8000 3.577000
8500 3.571200
9000 3.555600
9500 3.561500
10000 3.545800
10500 3.574700
11000 3.515000
11500 3.550900
12000 3.504600
12500 3.513500
13000 3.505500
13500 3.518900
14000 3.477200
14500 3.506300
15000 3.488400
15500 3.515900
16000 3.516000
16500 3.503700
17000 3.468400
17500 3.485000
18000 3.486700
18500 3.278000
19000 3.285900
19500 3.253900
20000 3.289800
20500 3.251700
21000 3.289900
21500 3.291400
22000 3.267700
22500 3.275400
23000 3.278800
23500 3.251600
24000 3.286700
24500 3.259500
25000 3.295000
25500 3.271200
26000 3.290600
26500 3.274200
27000 3.307400
27500 3.273600
28000 3.284300
28500 3.272800
29000 3.284500
29500 3.292500
30000 3.268700
30500 3.261500
31000 3.263300
31500 3.243900
32000 3.266300
32500 3.264600
33000 3.265800
33500 3.260700
34000 3.283200
34500 3.292100
35000 3.280100
35500 3.286100
36000 3.236600
36500 3.078300
37000 3.079400
37500 3.105900
38000 3.109900
38500 3.105100
39000 3.097400
39500 3.093500
40000 3.116700
40500 3.128900
41000 3.123000
41500 3.112500
42000 3.106000
42500 3.118700
43000 3.110200
43500 3.099000
44000 3.091400
44500 3.129300
45000 3.123600
45500 3.122000
46000 3.092900
46500 3.120100
47000 3.120900
47500 3.113600
48000 3.109800
48500 3.097600
49000 3.109300
49500 3.122900
50000 3.131400
50500 3.129000
51000 3.097700
51500 3.114600
52000 3.124800
52500 3.121900
53000 3.080900
53500 3.088000