gpt2-finetuned-agatha-christie

This model is a fine-tuned version of gpt2 on a text dataset containing Agatha Christies Books It achieves the following results on the evaluation set:

Loss: 3.0911

Model description

This is a GPT-2 model that is fine tuned with text corpus from Agatha Cristies books. GPT-2 is a transformers model pre trained on a very large corpus of English data in a self-supervised fashion.

Intended uses & limitations

Intended use for this model is to generate texts in the style of Agatha Christie, the queen of crime .

Although Ms. Christie has around 80 original works, not all of them are selected for copyright issues. Due to fine-tuning with a small dataset, sometimes the texts generated by this model might not be good enough.

Training and evaluation data

A custom made text corpus is made for training and validation purpose. Thirteen of the original work of Ms. Christie that are available in the public domain are chosen. Raw texts are downloaded from https://www.gutenberg.org/ and other available sources in the time frame of february 15th to 20th of 2023.

Data preprocessing:

Header and footer texts of project gutenberg are removed manually.
Text Illustrations are identified and removed.
All the new lines have been stripped off.
Special characters like = “” are removed.
Sentences that are longer than 200 characters have been removed.

Here are the list of books after data cleaning and pre processing:

File Name	Word Count	Character Count	Book Type
The_Murder_on_the_Links.txt	64470	383665	Novel
The_Mysterious_Affair_at_Styles.txt	56456	341202	Novel
The_Secret_of_Chimneys.txt	74431	455894	Novel
And_Then_There_Were_None.txt	52607	320398	Novel
The_murder_of_Roger_Ackroyd.txt	69485	416920	Novel
Poirot_Investigates.txt	52494	313466	Novel
The_Big_Four.txt	55230	319360	Novel
The_Mystery_of_the_Blue_Train.txt	71222	414922	Novel
The_Secret_Adversary.txt	10855	75138	Novel
The_Man_in_the_Brown_Suit.txt	10317	75261	Novel
The_Hunters_Lodge_Case.txt	4352	25602	Short Story
The_Missing_Will.txt	3257	19004	Short Story
The_Plymouth_Express_Affair.txt	4858	29493	Short Story
Total	659261	3928209

Spliting Training and evaluation data

nltk sentence tokenizer have been used to split all the texts into sentences. Then scikit learns train_test_split method is used to split 85% of random data sentence as training data, and the rest 15% of sentence as a evaluation data set. In training and test file, each line is placed in separate lines.

During training and validation, GPT2 default tokenizer is used.

Training procedure

Trainer class from transformer library is used to train the fine-tuned model.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 32
eval_batch_size: 64
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 6

Training results

Training Loss	Epoch	Step	Validation Loss
4.2824	0.26	50	3.8764
3.8824	0.51	100	3.5931
3.6336	0.77	150	3.4378
3.5056	1.03	200	3.3445
3.4038	1.28	250	3.2881
3.3502	1.54	300	3.2506
3.3135	1.79	350	3.2224
3.2839	2.05	400	3.2028
3.2193	2.31	450	3.1816
3.2066	2.56	500	3.1660
3.2043	2.82	550	3.1470
3.1619	3.08	600	3.1380
3.1092	3.33	650	3.1271
3.1073	3.59	700	3.1187
3.099	3.85	750	3.1109
3.0695	4.1	800	3.1089
3.0281	4.36	850	3.1044
3.0322	4.62	900	3.1002
3.0358	4.87	950	3.0944
3.0126	5.13	1000	3.0958
2.9889	5.38	1050	3.0931
2.9874	5.64	1100	3.0917
2.9915	5.9	1150	3.0911

Framework versions

Transformers 4.26.1
Pytorch 1.13.1+cu116
Datasets 2.10.0
Tokenizers 0.13.2