charles-dickens-gpt2

This model is a fine-tuned version of gpt2 on a dataset created by Charles Dickens's books. It achieves the following results on the evaluation set:

Loss: 3.2286

Model description

GPT-2 model is fine-tuned with text corpus from Charles Dickens's books.

Intended uses & limitations

The model will generate text in Charles Dickens's style. The limitation is that the texts generated are not always in a chronological manner. While fine-tuning, since the sentences were split based on the occurrence of a full stop, sentences having honorifics were truncated prematurely. An example would be sentences truncated after the dot of Mr.

List of books included in the text corpus

A Christmas Carol
A Tale of Two Cities
David Copperfield
Great Expectations
Hard Times
Hunted Down
Oliver Twist Vol 1 of 3
Oliver Twist
The Magic Fishbone

Number of tokens in each book and in total

Total number of tokens in the corpus: 1029751
A Christmas Carol: 28691
A Tale of Two Cities: 135641
David Copperfield: 353905
Great Expectations: 184350
Hard Times: 102939
Hunted Down: 8627
Oliver Twist Vol 1 of 3: 54622
Oliver Twist: 157172
The Magic Fishbone: 3805

Dataset

The total dataset comprises of nine books written by Charles Dickens. The books were downloaded in text format from the Project Gutenberg website. The books were downloaded between 18th and 24th of February of 2023. The data was collected to predict text in Charles Dickens's style. Link to the dataset: https://github.umn.edu/tasni008/charles-dickens-gpt2 https://drive.google.com/drive/folders/1f0R69L9jltXJaRcHJmcSkgMjnKk5VpRr?usp=sharing

Text Preprocessing

The new lines were replaced with whitespaces, and the sentences were divided into smaller sentences based on the occurrence of punctuations(?.;!). If the length of a sentence was greater than 200, it was discarded. Sentences having lengths between 100 and 200 were shortened one more time based on the occurrence of commas. The notes and texts related to Project Gutenberg at the beginning and end of each book were removed manually.

Training and evaluation data

The total data is split with 20% of it being test data. The number of tokens in the training dataset is 88981 and the number of tokens in the test dataset is 22246.

Training procedure

The model is trained with the trainer API from Huggingface transformer library.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 32
eval_batch_size: 64
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 4

Training results

Training Loss	Epoch	Step	Validation Loss
4.5557	0.19	50	4.1287
4.0166	0.39	100	3.6971
3.7114	0.58	150	3.5248
3.5769	0.77	200	3.4414
3.4964	0.97	250	3.3912
3.4327	1.16	300	3.3578
3.3962	1.35	350	3.3368
3.3791	1.54	400	3.3164
3.3573	1.74	450	3.2998
3.3419	1.93	500	3.2851
3.294	2.12	550	3.2762
3.2767	2.32	600	3.2665
3.2534	2.51	650	3.2563
3.2607	2.7	700	3.2471
3.2593	2.9	750	3.2401
3.2224	3.09	800	3.2409
3.1909	3.28	850	3.2358
3.192	3.47	900	3.2330
3.2001	3.67	950	3.2301
3.199	3.86	1000	3.2286

Framework versions

Transformers 4.26.1
Pytorch 1.13.1+cu116
Tokenizers 0.13.2