<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->
gpt2-finetuned-agatha-christie
This model is a fine-tuned version of gpt2 on a text dataset containing Agatha Christies Books It achieves the following results on the evaluation set:
- Loss: 3.0911
Model description
This is a GPT-2 model that is fine tuned with text corpus from Agatha Cristies books. GPT-2 is a transformers model pre trained on a very large corpus of English data in a self-supervised fashion.
Intended uses & limitations
Intended use for this model is to generate texts in the style of Agatha Christie, the queen of crime .
Although Ms. Christie has around 80 original works, not all of them are selected for copyright issues. Due to fine-tuning with a small dataset, sometimes the texts generated by this model might not be good enough.
Training and evaluation data
A custom made text corpus is made for training and validation purpose. Thirteen of the original work of Ms. Christie that are available in the public domain are chosen. Raw texts are downloaded from https://www.gutenberg.org/ and other available sources in the time frame of february 15th to 20th of 2023.
Data preprocessing:
- Header and footer texts of project gutenberg are removed manually.
- Text Illustrations are identified and removed.
- All the new lines have been stripped off.
- Special characters like = “” are removed.
- Sentences that are longer than 200 characters have been removed.
Here are the list of books after data cleaning and pre processing:
File Name | Word Count | Character Count | Book Type |
---|---|---|---|
The_Murder_on_the_Links.txt | 64470 | 383665 | Novel |
The_Mysterious_Affair_at_Styles.txt | 56456 | 341202 | Novel |
The_Secret_of_Chimneys.txt | 74431 | 455894 | Novel |
And_Then_There_Were_None.txt | 52607 | 320398 | Novel |
The_murder_of_Roger_Ackroyd.txt | 69485 | 416920 | Novel |
Poirot_Investigates.txt | 52494 | 313466 | Novel |
The_Big_Four.txt | 55230 | 319360 | Novel |
The_Mystery_of_the_Blue_Train.txt | 71222 | 414922 | Novel |
The_Secret_Adversary.txt | 10855 | 75138 | Novel |
The_Man_in_the_Brown_Suit.txt | 10317 | 75261 | Novel |
The_Hunters_Lodge_Case.txt | 4352 | 25602 | Short Story |
The_Missing_Will.txt | 3257 | 19004 | Short Story |
The_Plymouth_Express_Affair.txt | 4858 | 29493 | Short Story |
Total | 659261 | 3928209 |
Spliting Training and evaluation data
nltk sentence tokenizer have been used to split all the texts into sentences. Then scikit learns train_test_split method is used to split 85% of random data sentence as training data, and the rest 15% of sentence as a evaluation data set. In training and test file, each line is placed in separate lines.
During training and validation, GPT2 default tokenizer is used.
Training procedure
Trainer class from transformer library is used to train the fine-tuned model.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 6
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
4.2824 | 0.26 | 50 | 3.8764 |
3.8824 | 0.51 | 100 | 3.5931 |
3.6336 | 0.77 | 150 | 3.4378 |
3.5056 | 1.03 | 200 | 3.3445 |
3.4038 | 1.28 | 250 | 3.2881 |
3.3502 | 1.54 | 300 | 3.2506 |
3.3135 | 1.79 | 350 | 3.2224 |
3.2839 | 2.05 | 400 | 3.2028 |
3.2193 | 2.31 | 450 | 3.1816 |
3.2066 | 2.56 | 500 | 3.1660 |
3.2043 | 2.82 | 550 | 3.1470 |
3.1619 | 3.08 | 600 | 3.1380 |
3.1092 | 3.33 | 650 | 3.1271 |
3.1073 | 3.59 | 700 | 3.1187 |
3.099 | 3.85 | 750 | 3.1109 |
3.0695 | 4.1 | 800 | 3.1089 |
3.0281 | 4.36 | 850 | 3.1044 |
3.0322 | 4.62 | 900 | 3.1002 |
3.0358 | 4.87 | 950 | 3.0944 |
3.0126 | 5.13 | 1000 | 3.0958 |
2.9889 | 5.38 | 1050 | 3.0931 |
2.9874 | 5.64 | 1100 | 3.0917 |
2.9915 | 5.9 | 1150 | 3.0911 |
Framework versions
- Transformers 4.26.1
- Pytorch 1.13.1+cu116
- Datasets 2.10.0
- Tokenizers 0.13.2